Dr. Sumaiya Sande

PhD in Statistics and Applied Probability from National University of Singapore. Possess strong mathematical, Statistical and technical background in the field of data science. Avid reader and data enthusiast. Keen to learn new advances in technology and update the knowledge.

Welcome to My Webpage!!!

Find me at

Education

National University of Singapore (NUS)

Doctorate of Philosophy (PhD)
Department of Statistics and Applied Probability (DSAP)

Supervisor : Dr. Li Jialiang (DSAP, NUS)

August 2017 - March 2021

Indian Institute of Science Education and Research (IISER), Mohali

Integrated BS-MS in Mathematics

MS Thesis Supervisors : Prof. Somdatta Sinha (IISER, Mohali) and Dr. Samsiddhi Bhattacharjee (NIBMG, Kalyani)

August 2012 - May 2017

Experience

Associate Consultant

Northern Trust

At Northern Trust, I am building the credit risk model. It involves dealing with large data with seeveral loan level and macroeconomic characteristics. The data cleaning is extensively performed to achieve data quality. Subsequently, the ML model was applied to classify defaulting vs non-defaulting clients.

Sept 2022 - Aug 2023

Data Scientist

Qzense Labs

At Qzense labs, I built the model pipelines for various predictive jobs. It includes data fetching from the database, preprocessing the data, training the model, validating and testing the model and its deployment. It is the AutoML pipeline created in AWS. I also worked on the time series modelling to track fruit ripeness.

May 2021 - Aug 2022

PhD thesis

National University of Singapore

Worked on several projects such as deriving statistical Inference for Decision Curve Analysis (one of the accuracy measure for classification) with Application to cataract Diagnosis. Also, worked on modern Supervised Machine Learning and Deep Learning methods with end-to-end analysis.

August 2017 - March 2021

MS thesis

Indian Institute of Science Education and Research (IISER), Mohali

Worked extensively on the HIV dataset incorporating machine learning methods for dimentionality reduction such as principle component analysis, multi-dimensional scaling and others.

August 2016 - April 2017

Summer Intern

National Institute of Biomedical Genomics, Kalyani, West Bengal

Worked in the area of statistical genomics. Have done two projects comprizing of the machine learning methods for finding the association in GWAS (Genome Wise Association Studies) data. Used Empirical Bayes analysis for the project.

May-July 2016

Publications

Articles in Scientific Journals
  • Sande SZ, Li J, D'Agostino R, Yin Wong T, Cheng C-Y. Statistical inference for decision curve analysis, with applications to cataract diagnosis. Statistics in Medicine. 2020;1–23. Click here
  • Sumaiya Z. Sande, Loraine Seng, Jialiang Li, Ralph D’Agostino, Statistical Learning in Medical Research with Decision Threshold and Accuracy Evaluation, J. data sci. 19(2021), no. 4, 634-657, DOI 10.6339/21-JDS1022 Click here

Articles in Medium
  • Principal Component Analysis (PCA) : Theory (Published with Analytics Vidhya) See Article Here.
  • Get started with Time Series Forecasting in Python (Published with Analytics Vidhya) See Article Here.

Projects

Risk Management : Residential Real estate model

Northern Trust

In this project, I worked on Finance credit risk data. The data was huge with millions of records for loan level variables and several macroeconomic variables. The task was to identify the high risk (defaulting) clients with maximum accuracy. The biggest challenge in the project was the missing value imputations and low count of defaults vs non-defaults. The data was checked for the quality and the subsequent models were applied. The key analysis performed are Single factor analysis, multiple factor analysis, calculating the accuracies such as GINI, RMSE etc. SAS was used as programming tool.

Sept 2022 - Aug 2023

Fish Freshness Index Based On Gills Classification

Qzense Labs

In this project we classified fishes based on the freshness index for which we are using the gills images as the dataset. We built the transfer learning models such as 'Resnet' and also our inhouse-trained CNN model on this data to classify fishes in the 6 classes of freshness index. I also used ML model for the same. The challenging part in this project was to clean and preprocess the data as the images are taken form phone camera. Some of the solutions we tried to apply were cropping the image to extract the gills part, calculating the mucus percentage by slicing the image in several parts and then stack the mucus classifying models to get the freshness index, using dimensionality reduction for the ML model for images etc.

April 2022 - July 2022

Creating Model Pipelines for assessing the commodity features as brix and ripeness

Qzense Labs

At Qzense labs, I am building the model pipelines for various predictive jobs. It includes data fetching from the database, preprocessing the data, training the model, validating and testing the model and its deployment. It is the AutoML pipeline created in AWS. I am also working on the time series modelling to track the fruit ripeness.

May 2021 - January 2022

Practical Guide to Modern Machine Learning Methods for Classification with ROC and DCA

PhD thesis at National University of Singapore

Supervisor : Dr. Li Jialiang (DSAP, NUS)

Machine learning models are being used for medical data analysis to reduce human efforts and understand the patterns of disease propagation. When the data is unstructured, shallow machine learning methods may not be the feasible option to consider. Hence, deep learning neural networks like multilayer perceptron (MLP) and convolutional neural network (CNN), the state-of-the-art methods should be incorporated in medical diagnosis and prognosis for better results. For binary outcome variable, different accuracy measures like sensitivity, specificity and area under the receiver operating characteristic curve are used to assess the model efficacy. But they fail to account for the utility of the model itself in the analysis. Decision curve analysis is being used in medical studies to provide the solution to above problem. In this project, widely used supervised learning methods (shallow and deep) are reviewed and demonstrated using real clinical data. We also provide the R code to illustrate how to perform machine learning and deep learning methods. This project will help medical decision makers to understand different classification methods and how to use them in real world scenario.

August 2017 - March 2021

Statistical inference for decision curve analysis

PhD thesis at National University of Singapore

Supervisor : Dr. Li Jialiang (DSAP, NUS)

Statistical learning methods are widely used in medical literature for the purpose of diagnosis or prediction. Conventional accuracy assessment via sensitivity, specificity, and ROC curves does not fully account for clinical utility of a specific model. Decision curve analysis (DCA) becomes a novel complement as it incorporates a clinical judgment of the relative value of benefits (treating a true positive case) and harms (treating a false positive case) associated with prediction models. The preference of a patient or a policy‐maker is formulated statistically as the underlying threshold probability, above which the patient would choose to be treated. Net benefit is then calculated for possible threshold probability, which places benefits and harms on the same scale. We consider the inference problems for DCA in this paper. Interval estimation procedure and inference methodology are provided after we derive the relevant asymptotic properties. Our formulation can accommodate the classification problems with multiple categories. We carry out numerical studies to assess the performance of the proposed methods. An eye disease dataset is analyzed to illustrate our proposals. See Project

August 2017 - March 2021

Statistical Methods for Dimension Reduction and Feature Selection for Integrating Genomic and Other Biological Data 

MS thesis at IISER, Mohali

Supervisors : Prof. Somdatta Sinha (IISER, Mohali) and Dr. Samsiddhi Bhattacharjee (NIBMG, Kalyani)

This project has been done in three parts. In the first part, a simulation-based comparative study of variable selection was done in a linear-regression setting using a penalized-regression method - Least Absolute Selection and Shrinkage Operator (LASSO) versus univariate regression followed by the False Discovery Rate (FDR). Sensitivity, Specificity and Receiver Operating Characteristic (ROC) curves were used for comparison of these methods. In the second part, one of the Dimension Reduction Technique the Principal Component Analysis (PCA) was used to compare codon usage bias of HIV-1 viral genomes and genes to its human host using whole genome sequences. In the third part, Single Nucleotide Polymorphism (SNP) selection was done using Empirical Bayes strategy in Genome-wide Association Studies (GWAS). See Project

August 2016 - May 2017

Skills

Programming Languages & Tools
  • Python
  • R
  • SAS
  • MS Excel
  • AWS S3, AWS Sagemaker
  • SQL, Database Management
  • HTML
  • Github
  • Latex

Key Highlights
  • Mathematics
  • Statistics
  • Machine Learning
  • Deep Learning

Teaching Experience

I am working as a Teaching Assistant also at National University of Singapore and have tutored following Modules so far.

  • ST2334 : Probability and Statistics
  • ST3131 : Regression Analysis
  • ST1232 : Statistics for Life Sciences
  • ST2132 : Mathematical Statistics
  • ST3101 : Data Science in Practice
  • ST2137 : Computer Aided Data Analysis
  • IND5003 : Data Analytics for Sense-Making in Python

Scholarships

  • National University of Singapore Research Scholarship, (2017 - Present)
  • INSPIRE Fellowship, IISER Mohali (2012 - 2017)
  • National Network for mathematical and Computational Biology (NNMCB) Summer Research Fellowship, NIBMG Kalyani (2016)
  • Indian Academy of Sciences (IAS) Summer Research Fellowship, NIBMG Kalyani (2015)

Interests

  • Read scientific books
  • Listen to music, painting
  • Travel to new places and meeting new people
  • Watch movies, web-series