PhD in Statistics and Applied Probability from National University of Singapore. Possess strong mathematical, Statistical and technical background in the field of data science. Avid reader and data enthusiast. Keen to learn new advances in technology and update the knowledge.
Supervisor : Dr. Li Jialiang (DSAP, NUS)
MS Thesis Supervisors : Prof. Somdatta Sinha (IISER, Mohali) and Dr. Samsiddhi Bhattacharjee (NIBMG, Kalyani)
At Northern Trust, I am building the credit risk model. It involves dealing with large data with seeveral loan level and macroeconomic characteristics. The data cleaning is extensively performed to achieve data quality. Subsequently, the ML model was applied to classify defaulting vs non-defaulting clients.
At Qzense labs, I built the model pipelines for various predictive jobs. It includes data fetching from the database, preprocessing the data, training the model, validating and testing the model and its deployment. It is the AutoML pipeline created in AWS. I also worked on the time series modelling to track fruit ripeness.
Worked on several projects such as deriving statistical Inference for Decision Curve Analysis (one of the accuracy measure for classification) with Application to cataract Diagnosis. Also, worked on modern Supervised Machine Learning and Deep Learning methods with end-to-end analysis.
Worked extensively on the HIV dataset incorporating machine learning methods for dimentionality reduction such as principle component analysis, multi-dimensional scaling and others.
Worked in the area of statistical genomics. Have done two projects comprizing of the machine learning methods for finding the association in GWAS (Genome Wise Association Studies) data. Used Empirical Bayes analysis for the project.
In this project, I worked on Finance credit risk data. The data was huge with millions of records for loan level variables and several macroeconomic variables. The task was to identify the high risk (defaulting) clients with maximum accuracy. The biggest challenge in the project was the missing value imputations and low count of defaults vs non-defaults. The data was checked for the quality and the subsequent models were applied. The key analysis performed are Single factor analysis, multiple factor analysis, calculating the accuracies such as GINI, RMSE etc. SAS was used as programming tool.
In this project we classified fishes based on the freshness index for which we are using the gills images as the dataset. We built the transfer learning models such as 'Resnet' and also our inhouse-trained CNN model on this data to classify fishes in the 6 classes of freshness index. I also used ML model for the same. The challenging part in this project was to clean and preprocess the data as the images are taken form phone camera. Some of the solutions we tried to apply were cropping the image to extract the gills part, calculating the mucus percentage by slicing the image in several parts and then stack the mucus classifying models to get the freshness index, using dimensionality reduction for the ML model for images etc.
At Qzense labs, I am building the model pipelines for various predictive jobs. It includes data fetching from the database, preprocessing the data, training the model, validating and testing the model and its deployment. It is the AutoML pipeline created in AWS. I am also working on the time series modelling to track the fruit ripeness.
Supervisor : Dr. Li Jialiang (DSAP, NUS)
Machine learning models are being used for medical data analysis to reduce human efforts and understand the patterns of disease propagation. When the data is unstructured, shallow machine learning methods may not be the feasible option to consider. Hence, deep learning neural networks like multilayer perceptron (MLP) and convolutional neural network (CNN), the state-of-the-art methods should be incorporated in medical diagnosis and prognosis for better results. For binary outcome variable, different accuracy measures like sensitivity, specificity and area under the receiver operating characteristic curve are used to assess the model efficacy. But they fail to account for the utility of the model itself in the analysis. Decision curve analysis is being used in medical studies to provide the solution to above problem. In this project, widely used supervised learning methods (shallow and deep) are reviewed and demonstrated using real clinical data. We also provide the R code to illustrate how to perform machine learning and deep learning methods. This project will help medical decision makers to understand different classification methods and how to use them in real world scenario.
Supervisor : Dr. Li Jialiang (DSAP, NUS)
Statistical learning methods are widely used in medical literature for the purpose of diagnosis or prediction. Conventional accuracy assessment via sensitivity, specificity, and ROC curves does not fully account for clinical utility of a specific model. Decision curve analysis (DCA) becomes a novel complement as it incorporates a clinical judgment of the relative value of benefits (treating a true positive case) and harms (treating a false positive case) associated with prediction models. The preference of a patient or a policy‐maker is formulated statistically as the underlying threshold probability, above which the patient would choose to be treated. Net benefit is then calculated for possible threshold probability, which places benefits and harms on the same scale. We consider the inference problems for DCA in this paper. Interval estimation procedure and inference methodology are provided after we derive the relevant asymptotic properties. Our formulation can accommodate the classification problems with multiple categories. We carry out numerical studies to assess the performance of the proposed methods. An eye disease dataset is analyzed to illustrate our proposals. See Project
Supervisors : Prof. Somdatta Sinha (IISER, Mohali) and Dr. Samsiddhi Bhattacharjee (NIBMG, Kalyani)
This project has been done in three parts. In the first part, a simulation-based comparative study of variable selection was done in a linear-regression setting using a penalized-regression method - Least Absolute Selection and Shrinkage Operator (LASSO) versus univariate regression followed by the False Discovery Rate (FDR). Sensitivity, Specificity and Receiver Operating Characteristic (ROC) curves were used for comparison of these methods. In the second part, one of the Dimension Reduction Technique the Principal Component Analysis (PCA) was used to compare codon usage bias of HIV-1 viral genomes and genes to its human host using whole genome sequences. In the third part, Single Nucleotide Polymorphism (SNP) selection was done using Empirical Bayes strategy in Genome-wide Association Studies (GWAS). See Project
I am working as a Teaching Assistant also at National University of Singapore and have tutored following Modules so far.