Select Projects 

Project | 01
RSNA Intracranial Hemorrhage Detection with CNNs
Concepts and Key Words:
deep learning, convolutional neural networks, image recognition, big data, healthcare, diagnostics​, cloud computing, multi-label classification
  • Built and tuned a convolutional neural network that used over 100,000 CT scans (DICOM files) to predict whether a given scan had a hemorrhage, and if did, which of four subtypes it belonged to. 
  • Notable Libraries Used: TensorFlow, Keras, Pydicom
  • Ran neural network on a virtual computer with a GPU using Paperspace. Files amounted to over 1TB large.
  • Biggest Challenges: Image manipulation and windowing, large dataset management, hyperparameter tuning with a multi-label classification problem.
Project | 02
Oncogenic Classification of Genetic Variations
Concepts and Key Words:
feature engineering using domain knowledge, natural language processing, unbalanced data, multiclass classification, healthcare, cancer bioinformatics
  • Original training data included which of 9 oncogenic classes data an observation belonged to and its specific combination of gene + variation ('truncating mutation', 'promoter mutation', 'Q137R', 'W430A', etc.) as well as a text file (~6000 words for each observation) containing clinical information about that genetic variation that could help in classification.
  • I binarized the information in the variation column in a few ways. One was the type of mutation: missense, nonsense, promoter region-associated, deletion, insertion, insertion-deletion, etc.
 
  • Furthermore, I broke down missense mutations into those that encoded an amino acid of a different chemical group (polar, nonpolar, acidic, basic). Sickle cell disease is an example of how a chemical group change of encoded amino acids could have a large effect, as Gluatamic Acid (acidic) is changed to Valine (nonpolar), and I hypothesized a strong change in protein function would affect oncogenic classification.
  • I processed words in various ways and only saw modest increases in precision/recall scores. This is the focus of improvements to the model. I saw no significant increase in evaluation metrics when taking into consideration the amino acid position of the genetic variation.
  • Created and validated Random Forest, Bernoulli and Multinomial Naïve Bayes, Support Vector Machines, and other models. Random forest ended up being my best model, accurately predicting just over 60% of the unseen test data into their correct class of 9.  
  • Notable Libraries Used: Natural Language Processing Toolkit (NLTK), Scikit-learn (including many of the models and gridsearch)
  • Challenges: inbalanced data (to ameliorate, used stratified k-folds to split data). Data set rather small at 3,321 rows, though wide with all of the text information.​​
Project | 03
Modeling FDA Drug Adverse Reactions 
Concepts and Key Words:
API calls and authentication, government data, time-series modeling, forecasting, binary classification, large amounts of missing data, data wrangling
  • Wrote a python script to retrieve 435,000 data observations with 21 columns (many of which comprised lists) from openFDA API. This data focused on reported adverse drug reactions.
  • Novel project - idea and goal of project self-defined as opposed to originating in a competition. The goal was to model the number of 'serious' and non-'serious' adverse reaction reports over the next five years. This would be useful for several reasons, one of which being FDA allotment of resources and evaluation of programs to eliminate underreporting, especially of non-'serious' adverse reactions, to the FDA. 
  • Notable Libraries Used: Scikit-learn​, Requests
  • Biggest Challenges: API calls limited to 100 observations per call. Needed to scan through several layers in the JSON files, each observation had a different number of keys and a different number of items within those keys. For example, if an individual were on multiple medications, they would have two items in a list for the key labeled 'drug'. 

©SEBASTIAN MONZON 2019