This work was initially motivated by a model built for NCAA basketball bracket prediction. Current and previous work in this area show slight improvements each year to the model prediction but some suspect there to be a “ceiling” in terms of model improvement for this problem (with most models now reaching only 75% accuracy).
Instead of relying on traditional pre-processing data normalization procedures, I decided to leverage my background in public health and apply known genetic sequence normalization techniques to the basketball data. My best models utilized this genetic pre-processing method but it was statistically unclear why this was the case.
Looking into the research around data normalization in the genetic sequencing space, normalization is often study-specific, resulting in issues of study replication. There appears to be no consistent pre-processing pipeline in this area, and the downstream analysis affects of the various normalization choices is not well documented.
My study aims to quantify the downstream analysis affects of pre-processing normalization and feature engineering choices utilizing a decomposition of the loss functions with respect to both traditional normalization strategies as well as the more complex genetic-sequence-based methods, measuring affects on model bias, variance, as well as data structure variance and irreducible error.
This work and methodology is still in progress so this represents only a starting point of my literature review. However, as I will be using this towards my proposal, it will be greatly helpful to at least a baseline review of my current sources. This is my first time trying this service so hopefully what I’ve provided is appropriate. The number of pages is an estimate.