基因表达预测的线性回归和Logistic回归
Lab1: Regression
Part 1: Linear Regression for Gene Expression Prediction (40 points)
There are ~20,000 genes in the human genome. Each one of them is transcribed to mRNA and then translated to proteins which carry on various tasks inside our body. We can measure the amount of 20,000 mRNA expressed in samples collected from different organs. This collection is called gene expression profile.
Although our genome is the same across all cell types, the gene expression profile is different because each organ needs different proteins for its survival. One of the regulatory mechanisms which controls the expression level in each cell type is microRNA (miR). MicroRNAs are small molecules which attach to mRNAs and prevent them from translation to proteins and also degrade them.
So if miR A targets mRNA B when A increases B decreases. Our goal is to predict mRNA levels (gene expression profile) using 21 miR features. Note that each of the 20,000 expression levels can be a response of regression with 21 features. To simplify, we have selected a few genes to predict their expression.
Your job will be to investigate how well the miR values predict the mRNA values.
You are recommended to use the sklearn.linear_model package to conduct linear regression experiments, but you may use other packages if you wish.
You will need the following data files:
Load the provided data and implement the code required for the following steps:
You should randomly divide the samples into 80/20 training/test splits, and repeat the experiment 10 times to give mean and standard deviation of the metrics.
Predict all of the well-expressed and poorly expressed genes by a linear regression predictor for each mRNA. You should solve 35 linear regression problems.
Report on the following:
Describe the differences you see across the well and poorly expressed gene sets.
In dummy variable coding of categorical variable X with n levels, we add n – 1 columns to our features. The first level is coded as zero and then for each level, we set one of the columns to 1. For example, if we have a categorical feature for “Direction” with four levels “South, West, North, East” the following codes are required:
So for 33 levels for the “Tissue” feature, you need to add 32 columns to your feature (design) matrix. With the newly added feature run the linear regression again with the 80/20 split and report any change in prediction performance of your model and explain it.
Part 2: Logistic Regression (40 points)
In this exercise, you will implement logistic regression by gradient descent. You should not use off the shelf logistic regression solvers for this problem. This will also exercise your data skills, so you may want to read up on the pandas toolkit if you wish to use python.
Auxiliary notes: Logistic regression for binary prediction
Problem: you are given a dataset of 400 people; half female/half male, also half of the people are basketball players and half are not. The data has three features: height (inches), weight (pounds), and female (0=male, 1=female). The variable you want to predict is basketball player (0=non-player, 1=player).