- What is my background
- How I got into ML
- Overview of ways I've used ML
- What I wish I had known earlier

- High school "computing" and maths
- Bioscience Undergraduate: insigificant courses and research project
- Finishing Bioinformatics PhD looking
- So no significant formal training in ML, maths or computer science

- Christopher Bishop's "Pattern Recognition and Machine Learning"
- Kevin Murphy's "Machine Learning: A Probabilistic Perspective"
- Gilbert Strang's "Linear Algebra and Its Applications"

- Provided Dataset
- Evaluation Metric
- Public-Private scoreboards
- Top N winners methods and code are released publically (on Kaggle's No Free Hunch Blog)

- 400M 150bp PE reads (159GB)

- K-means clustering
- Fit GMM using Expectation-Maximisation
- Python (SKLearn) still hadn't finished PARSING input after 48 hours
- C++ (MLPACK/ARMADILLO): 12 hours (6GB of memory) single threaded
- Lesson: use the right tool for the job (Python: Scikit-Learn, R: Caret, Java: Weka, C++: MLPACK etc)

- Data preprocessing: downsampling, cleaning
- Channel correlations: independent component analysis, common spatial patterns, MVARs
- Approximately 850 different features
- Recursive Feature Elimination

- Random forests
- Support Vector Machines
- Logistic Regression
- Adaboost
- Ensembles!

- Top 5% (16/504)
- Reasonable accurate prediction - 0.77014 AUC
- Team: Gavin Gray and Scott Lowe
- Lesson: planning and ensembles

- Metatranscriptome origin classification
- Goal: classify transcripts into origin species

- Features: Phylogenetic tree and sequence features (GC, trinucleotide)
- SVM/RVM
- F1 Score nearly as good as manual (ad hoc self-consistency...)

- Diabetic retinopathy detection
- Two eyes, 5 ratings for each eye
- Same ML approach as NeuKrill-Net
- Difficult cost function (quadratic negative kappa)

*de novo*assembly parameter optimisation (k-mer size, minimum coverage, normalisation)- Bayesian optimisation of assembly likelihood
- Spearmint experimentation evaluated using RSEM-eval
- Key challenges i.e. representative subset, generalising for assemblers

- Short-form spoken word recommender system
- Goal: Recommender system for audioclips
- Hare brained startup
- Audiotranscription and metadata (project Gutenberg)
- Very much a work in progress
- RNN and LSTM are really cool

- Exploratory data analysis
- Plot density, scatter, structure of data

- "any two [...] algorithms are equivalent when their performance is averaged across all possible problems"
- No universally best optimiser, classifier, or metric
- Wolpert and McCreedy proofs
- Real world not so nicely random

- The ML algorithm (e.g. KNN, SVM, RF etc) is only part of the problem
- Important to be aware of evaluation (e.g. sum of squares distance, error)
- and optimisation method (e.g. gradient descent, expectation-maximisation)
- Can make a big difference to performance (and run time)
- Ties into NFL theory

- Intuitions will fail in high dimensions

- Most of the work is the boring stuff
- Data exploration
- Data gathering
- Data cleaning
- More data can beat smarter algorithms (but not always)

- Worth taking time setting up tools and data
*UNIT TESTS*- Literate programming e.g. Rmd, Jupyter
- Version control e.g. git

- Bagging (average estimators trained on random subsets of data)
- Merging (model averaging)
- Stacking (feed estimators into other estimators)
- Boosting (iteratively train estimators on data that previous models misclassify)

- Machine learning can be used to do cool things
- It is not as opaque as it appears
- Optimisation and Evaluation is as important as ML algorithm
- Highly iterative process
- Use version control and literate programming to save yourself a lot of difficulty