- Fine-tuning prot_bert, the model was trained just like bert, so I thought I could use [SEP] token to separate MHC and peptide sequences and use the output of token [CLS] at the beginning for the classifier head unfortunately due to lack of resources I only managed to run the model for 1.5 epochs because each epoch took 19 hours, with that said I achieved avg Precision of 90% and a good enough ROC-curve and my F1 score was about 80 percent which could drastically change with 4-5 more epochs.
- The other solution that I didn't have time to try because of the time the first one took was using facebook ESM model to embed the sequences and then feed it to a neural network, although because of the huge demension I was going to use PCA to lower the dimension while keeping the important features in the data
in EDA I searched and found out some info about MHCs and extracted some features from given MHC type, like allele group, etc. Then I cleaned the data and used in bert notebook to tokenize train and finally test the model. I used a dense layer with Relu activation and some drop out to prevent from over-fitting and a sigmoid function to create the answer in form of a probability. Because of the large model state (1.8 GB) I didn't include it in the uploaded files. And lastly in the evaluate-model notebook I evaluated the model using the test answers created from the bert notebook which is included with the solution.