School of Information Technology, Deakin University, VIC 3125, Australia.

Assignment 4

In this assignment, you will use a lot of concepts learnt in this course to come up with a good solution for a given human activity recognition problem. This assignment has 4 parts.

Instructions

The dataset consists of training and testing data in "train" and "test" folders. Use training data: X_train.txt labels: y_train.txt and testing data: X_test.txt labels: y_test.txt. There are other files that also come with the dataset and may be useful in understanding the dataset better.

Please read the pdf file "dataset-paper.pdf" to answer Part 1.

Part 1: Understanding the data

Answer the following questions briefly, after reading the paper(3 Marks)

What is the objective of the data collection process?

What human activity types does this dataset have? How many subjects/people have performed these activities?

How many instances are available in the training and test sets? How many features are used to represent each instance? Summarize the type of features extracted in 2-3 sentences.

Describe briefly what machine learning model is used in this paper for activity recognition and how is it trained. How much is the maximum accuracy achieved?

Part 2: K-Nearest Neighbour Classification

Build a K-Nearest Neighbor classifier for this data.(5 Marks)

Let K take values from 1 to 50. For choosing the best K, use 10-fold cross-validation. Choose the best value of K based on model F1-score.

Show a plot of cross-validation accuracy with respect to K.

Using the best K value, evaluate the model performance on the supplied test set. Report the confusion matrix, multi-class averaged F1-score and accuracy.

Part 3: Multiclass Logistic Regression with Elastic Net

Build an elastic-net regularized logistic regression classifier for this data.(5 Marks)

Elastic-net regularizer takes in 2 parameters: alpha and l1-ratio. Use the following values for alpha: 1e-4,3e-4,1e-3,3e-3, 1e-2,3e-2. Use the following values for l1-ratio: 0,0.15,0.5,0.7,1.

Choose the best values of alpha and l1-ratio using 10-fold cross-validation, based on model F1-score.

Draw a surface plot of F1-score with respect to alpha and l1-ratio values.

Use the best value of alpha and l1-ratio to re-train the model on the training set and use it to predict the labels of the test set. Report the confusion matrix, multi-class averaged F1-score and accuracy.

Part 4: Support Vector Machine (RBF Kernel)

Build a SVM (with RBF Kernel) classifier for this data.(6 Marks)

SVM with RBF takes 2 parameters: gamma (length scale of the RBF kernel) and C (the cost parameter). Use the following values for gamma: 1e-3, 1e-4. Use the following values for C: 1, 10, 100, 1000.

Choose the best values of gamma and C using 10-fold cross-validation, based on model F1-score.

Draw a surface plot of F1-score with respect to gamma and C.

Use the best value of gamma and C to re-train the model on the training set and use it to predict the labels of the test set. Report the confusion matrix, multi-class averaged F1-score and accuracy.

Part 5: Random Forest

Build a Random forest classifier for this data.(6 Marks)

Random forest uses two parameters: the tree-depth for each decision tree and the number of trees. Use the following values for the tree-depth: 300,500,600. Use the following values for the number of trees: 200,500,700.

Choose the best values of tree-depth and number of trees using 10-fold cross-validation, based on model F1-score.

Draw a surface plot of F1-score with respect to tree-depth and number of trees.

Use the best value of tree-depth and number of trees to re-train the model on the training set and use it to predict the labels of the test set. Report the confusion matrix, multi-class averaged F1-score and accuracy.

Part 6: Discussion

Write a brief discussion about which classification method achieved the best performance. Your thoughts on the reason behind this. What method performed the worst? Could you do better or worse than the results in the dataset paper? Do you have any suggestions to further improve model performances?(5 Marks)