NTU EE5184 Machine Learning 2022 学习笔记

NTU EE5184 Machine Learning 2022 学习笔记

SkqLiao 出版于学习笔记

2023-11-14 约 5744 字预计阅读 12 分钟

记录本课程的所有HW，目前进度2/15。

HW1 Regression: COVID-19 Cases Prediction

根据美国某州过去5天的调查结果，预测5天内新增检测阳性病例的百分比。

数据包括：

ID（1个）
one-hot编码的州名（37个）
COVID相似的疾病情况（4种）
行为指标（8种）
心理健康指标（3种）
实测阳性病例的百分比（1个）

因此训练数据是118列（1+37+(4+8+3+1)*5），测试数据则是117列。

covid.train.csv的大小为(2700,118)，数据倒不是很多。

作业提供了一个代码模板，需要修改的地方仅有列选取的函数select_feat、训练参数方法相关的optimizer、scheduler、config内的超参数和模型的结构self.layers。

采用模板代码训练，score为1.5637，接近到medium baseline。

根据代码中提示，增加L2正则化，并将优化器从SGD改为Adam，结果有所提升。

将Scheduler设置为torch.optim.lr_scheduler.CosineAnnealingWarmRestarts，即余弦退火，结果有所提升。

修改网络，多增加几层，让网络更深，结果有所提升。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
class My_Model(nn.Module):
    def __init__(self, input_dim):
        super(My_Model, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 4),
            nn.ReLU(),
            nn.Linear(4, 2),
            nn.ReLU(),
            nn.Linear(2, 1)
        )

接下来考虑数据。州的信息没有意义，因此直接去掉这37行，score直接提升到1.009，直接达到strong baseline。

再通过sklearn.feature_selection.SelectKBest选择最佳的32/48个列，但是没有明显提升。

接下来就很麻烦了，多次修改超参数和网络，但是score并没有多少提升。最好的一次达到0.943，但是距离boss baseline还是有不小差距。

最终经过数小时漫长的调试，最终score达到0.85092，卡入boss baseline。

再次考虑数据，根据我对COVID的理解，阳性率和心理健康指标基本没有关系，这三列数据直接去掉。另外，训练数据只有2700个，考虑扩充数据。原本是通过5天的数据预测，现在可以拆成2天/3天，这样训练集就扩大了4倍/3倍。

最后是改进网络，发现较深的网络效果不好，干脆改成一层非常宽的网络。

1
2
3
4
5
6
7
8
9
class My_Model(nn.Module):
    def __init__(self, input_dim):
        super(My_Model, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 192),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(192, 1),
        )

最终分数：private score: 0.90417，public score: 0.85241。很遗憾私榜没进boss baseline，但感觉沿着这个思路再调也很难提升了（事实上private score最低也就达到过0.89，距离0.86还差不少）。

ml2022spring_hw1.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
# -*- coding: utf-8 -*-
"""ML2022Spring - HW1.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1FTcG6CE-HILnvFztEFKdauMlPKfQvm5Z

# **Homework 1: COVID-19 Cases Prediction (Regression)**

Objectives:
* Solve a regression problem with deep neural networks (DNN).
* Understand basic DNN training tips.
* Familiarize yourself with PyTorch.

If you have any questions, please contact the TAs via TA hours, NTU COOL, or email to mlta-2022-spring@googlegroups.com

# Download data
If the Google Drive links below do not work, you can download data from [Kaggle](https://www.kaggle.com/t/a3ebd5b5542f0f55e828d4f00de8e59a), and upload data manually to the workspace.

!gdown --id '1kLSW_-cW2Huj7bh84YTdimGBOJaODiOS' --output covid.train.csv
!gdown --id '1iiI5qROrAhZn-o4FPqsE97bMzDEFvIdg' --output covid.test.csv
"""


"""# Import packages"""

# Numerical Operations

# Reading/Writing Data

# For Progress Bar

# Pytorch

# For plotting learning curve

"""# Some Utility Functions

You do not need to modify this part.
"""




from sklearn.feature_selection import SelectKBest, f_regression
import math
import numpy as np
import pandas as pd
import os
import csv
from tqdm import tqdm
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
from torch.utils.tensorboard import SummaryWriter
def same_seed(seed):
    '''Fixes random number generator seeds for reproducibility.'''
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


def train_valid_split(data_set, valid_ratio, seed):
    '''Split provided training data into training set and validation set'''
    valid_set_size = int(valid_ratio * len(data_set))
    train_set_size = len(data_set) - valid_set_size
    train_set, valid_set = random_split(data_set, [
                                        train_set_size, valid_set_size], generator=torch.Generator().manual_seed(seed))
    return np.array(train_set), np.array(valid_set)


def predict(test_loader, model, device):
    model.eval()  # Set your model to evaluation mode.
    preds = []
    for x in tqdm(test_loader):
        x = x.to(device)
        with torch.no_grad():
            pred = model(x)
            preds.append(pred.detach().cpu())
    preds = torch.cat(preds, dim=0).numpy()
    return preds


"""# Dataset"""


class COVID19Dataset(Dataset):
    '''
    x: Features.
    y: Targets, if none, do prediction.
    '''

    def __init__(self, x, y=None):
        if y is None:
            self.y = y
        else:
            self.y = torch.FloatTensor(y)
        self.x = torch.FloatTensor(x)

    def __getitem__(self, idx):
        if self.y is None:
            return self.x[idx]
        else:
            return self.x[idx], self.y[idx]

    def __len__(self):
        return len(self.x)


"""# Neural Network Model
Try out different model architectures by modifying the class below.
"""


class My_Model(nn.Module):
    def __init__(self, input_dim):
        super(My_Model, self).__init__()
        # TODO: modify model's structure, be aware of dimensions.
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 192),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(192, 1),
        )

    def forward(self, x):
        x = self.layers(x)
        x = x.squeeze(1)  # (B, 1) -> (B)
        return x


"""# Feature Selection
Choose features you deem useful by modifying the function below.
"""


def split_data(data):
    p = np.array([0, 1, 2, 3, 15, 16, 17, 18, 19, 31, 32, 33, 34, 35])
    X = [data[:, p + i * 16 + 38] for i in range(3)]
    Y = [data[:, (i+3)*16 + 38 - 1] for i in range(3)]
    X = np.vstack(X)
    Y = np.hstack(Y)
    return X, Y


def select_feat(train_data, valid_data, test_data, select_all=True):
    '''Selects useful features to perform regression'''
    raw_x_train, y_train = split_data(train_data)
    raw_x_valid, y_valid = split_data(valid_data)
    raw_x_test = test_data[:, np.array(
        [0, 1, 2, 3, 15, 16, 17, 18, 19, 31, 32, 33, 34, 35]) + 38 + 16 * 2]
    return raw_x_train, raw_x_valid, raw_x_test, y_train, y_valid


"""# Training Loop"""


def trainer(train_loader, valid_loader, model, config, device):

    # Define your loss function, do not modify this.
    criterion = nn.MSELoss(reduction='mean')

    # Define your optimization algorithm.
    # TODO: Please check https://pytorch.org/docs/stable/optim.html to get more available algorithms.
    # TODO: L2 regularization (optimizer(weight decay...) or implement by your self).
    # optimizer = torch.optim.SGD(model.parameters(), lr=config['learning_rate'], momentum=0.9, weight_decay=1e-3)
    optimizer = torch.optim.Adam(model.parameters(), weight_decay=5e-5)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer,
                                                                     T_0=2, T_mult=3, eta_min=0)
    writer = SummaryWriter()  # Writer of tensoboard.

    if not os.path.isdir('./models'):
        os.mkdir('./models')  # Create directory of saving models.

    n_epochs, best_loss, step, early_stop_count = config['n_epochs'], math.inf, 0, 0

    for epoch in range(n_epochs):
        model.train()  # Set your model to train mode.
        loss_record = []

        # tqdm is a package to visualize your training progress.
        train_pbar = tqdm(train_loader, position=0, leave=True)

        for x, y in train_pbar:
            optimizer.zero_grad()               # Set gradient to zero.
            x, y = x.to(device), y.to(device)   # Move your data to device.
            pred = model(x)
            loss = criterion(pred, y)
            # Compute gradient(backpropagation).
            loss.backward()
            optimizer.step()                    # Update parameters.
            step += 1
            loss_record.append(loss.detach().item())

            # Display current epoch number and loss on tqdm progress bar.
            train_pbar.set_description(f'Epoch [{epoch+1}/{n_epochs}]')
            train_pbar.set_postfix({'loss': loss.detach().item()})
        scheduler.step()
        mean_train_loss = sum(loss_record)/len(loss_record)
        writer.add_scalar('Loss/train', mean_train_loss, step)

        model.eval()  # Set your model to evaluation mode.
        loss_record = []
        for x, y in valid_loader:
            x, y = x.to(device), y.to(device)
            with torch.no_grad():
                pred = model(x)
                loss = criterion(pred, y)

            loss_record.append(loss.item())

        mean_valid_loss = sum(loss_record)/len(loss_record)
        print(
            f'Epoch [{epoch+1}/{n_epochs}]: Train loss: {mean_train_loss:.4f}, Valid loss: {mean_valid_loss:.4f}')
        writer.add_scalar('Loss/valid', mean_valid_loss, step)

        if mean_valid_loss < best_loss:
            best_loss = mean_valid_loss
            # Save your best model
            torch.save(model.state_dict(), config['save_path'])
            print('Saving model with loss {:.3f}...'.format(best_loss))
            early_stop_count = 0
        else:
            early_stop_count += 1

        if early_stop_count >= config['early_stop']:
            print('\nModel is not improving, so we halt the training session.')
            return


"""# Configurations
`config` contains hyper-parameters for training and the path to save your model.
"""

device = 'cuda' if torch.cuda.is_available() else 'cpu'
config = {
    'seed': 19260817,      # Your seed number, you can pick your lucky number. :)
    'select_all': False,   # Whether to use all features.
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 5000,     # Number of epochs.
    'batch_size': 192,
    'learning_rate': 1e-5,
    # If model has not improved for this many consecutive epochs, stop training.
    'early_stop': 500,
    'save_path': './models/model.ckpt'  # Your model will be saved here.
}

"""# Dataloader
Read data from files and set up training, validation, and testing sets. You do not need to modify this part.
"""

# Set seed for reproducibility
same_seed(config['seed'])


# train_data size: 2699 x 118 (id + 37 states + 16 features x 5 days)
# test_data size: 1078 x 117 (without last day's positive rate)
train_data, test_data = pd.read_csv(
    './covid.train.csv').values, pd.read_csv('./covid.test.csv').values
train_data, valid_data = train_valid_split(
    train_data, config['valid_ratio'], config['seed'])

# Print out the data size.
print(f"""train_data size: {train_data.shape}
valid_data size: {valid_data.shape}
test_data size: {test_data.shape}""")

# Select features
x_train, x_valid, x_test, y_train, y_valid = select_feat(
    train_data, valid_data, test_data, config['select_all'])

# Print out the number of features.
print(f'number of features: {x_train.shape[1]}')

train_dataset, valid_dataset, test_dataset = COVID19Dataset(x_train, y_train), \
    COVID19Dataset(x_valid, y_valid), \
    COVID19Dataset(x_test)

# Pytorch data loader loads pytorch dataset into batches.
train_loader = DataLoader(
    train_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
valid_loader = DataLoader(
    valid_dataset, batch_size=config['batch_size'], shuffle=True, pin_memory=True)
test_loader = DataLoader(
    test_dataset, batch_size=config['batch_size'], shuffle=False, pin_memory=True)

"""# Start training!"""
# put your model and data on the same computation device.
model = My_Model(input_dim=x_train.shape[1]).to(device)
trainer(train_loader, valid_loader, model, config, device)

"""# Plot learning curves with `tensorboard` (optional)

`tensorboard` is a tool that allows you to visualize your training progress.

If this block does not display your learning curve, please wait for few minutes, and re-run this block. It might take some time to load your logging information.
"""

# Commented out IPython magic to ensure Python compatibility.
# %reload_ext tensorboard
# %tensorboard --logdir=./runs/

"""# Testing
The predictions of your model on testing set will be stored at `pred.csv`.
"""


def save_pred(preds, file):
    ''' Save predictions to specified file '''
    with open(file, 'w') as fp:
        writer = csv.writer(fp)
        writer.writerow(['id', 'tested_positive'])
        for i, p in enumerate(preds):
            writer.writerow([i, p])


model = My_Model(input_dim=x_train.shape[1]).to(device)
model.load_state_dict(torch.load(config['save_path']))
preds = predict(test_loader, model, device)
save_pred(preds, 'pred.csv')

"""# Reference
This notebook uses code written by Heng-Jui Chang @ NTUEE (https://github.com/ga642381/ML2021-Spring/blob/main/HW01/HW01.ipynb)
"""

HW2 Classification: LibriSpeech phoneme classification

根据语音数据预测音素。其中从原始音频中分解出MFCC特征的预处理已经由TA完成，我们只需要设计网络结构并训练即可。

传统做法是使用CNN，叠一定的层数和宽度，加上dropout和batchnorm，最后接一个全连接层输出概率。调一调参即可达到Strong Baseline。

以下的超参数在kaggle上获得了0.75771的 private score 和0.75427的 public score 。

1
2
3
4
5
6
7
8
9
concat_nframes = 31
hidden_layers = 11
hidden_dim = 2048
dropout_rate = 0.15
batchnorm = True
train_ratio = 0.9
batch_size = 1024 
num_epoch = 10
learning_rate = 0.0001

如果想达到Boss Baseline，就必须使用RNN。

使用RNN前，需要修改输入数据的格式，不再将concat_nframes拼在一起，shape从(batch_size, concat_nframes * 39) 变成 (batch_size, concat_nframes, 39)。

但是根据我的两天来的多次尝试，发现直接使用BiLSTM的效果并不好。我调了很多次参，准确度甚至都上不了0.6。

经过网上资料的查询，发现LSTM和CRF（条件随机场）的组合效果很好，尝试后发现只要训练1个epoch，验证集就能达到0.75以上的准确度，几轮下来就能接近0.8。

对此的解释大概为：LSTM只学习了特征的上下文关系，而结合CRF则学习到了label的上下文关系。

经过多轮调参（训练实在是太慢了，在autoDL上租的4090跑一次都要大半个小时，本机4070还要再慢近一倍），最后终于达到了Boss Baseline，在kaggle上获得了0.8283的 private score 和0.8271的 public score ，超参数和网络结构如下。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class BiLSTM(nn.Module):
    def __init__(self, class_size=41, input_dim=39, hidden_dim=256, dropout=0.35):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.class_size = class_size
        self.lstm = nn.LSTM(input_dim, hidden_dim // 2, dropout=dropout,
                            num_layers=6, bidirectional=True, batch_first=True)
        self.hidden2tag = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, class_size)
        )
        
    def forward(self, x):
        feats, _ = self.lstm(x)
        return self.hidden2tag(feats)
    

concat_nframes = 29
hidden_layers = 6
hidden_dim = 256
dropout_rate = 0.35
batchnorm = True
train_ratio = 0.9
batch_size = 512
num_epoch = 50 # 但实际上跑了10轮左右就收敛了
learning_rate1 = 0.0001 * 25
learning_rate2 = 0.0001 * 750 # CRF的学习率较大

参考代码修改自CSDN 李宏毅2022机器学习HW2解析。

ml2022spring_hw2.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
import gc
import numpy as np
from torchcrf import CRF
import torch.nn.functional as F
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import os
import random
import pandas as pd
import torch
from tqdm import tqdm


def load_feat(path):
    feat = torch.load(path)
    return feat


def shift(x, n):
    if n < 0:
        left = x[0].repeat(-n, 1)
        right = x[:n]

    elif n > 0:
        right = x[-1].repeat(n, 1)
        left = x[n:]
    else:
        return x

    return torch.cat((left, right), dim=0)


def concat_feat(x, concat_n):
    assert concat_n % 2 == 1  # n must be odd
    if concat_n < 2:
        return x
    seq_len, feature_dim = x.size(0), x.size(1)
    x = x.repeat(1, concat_n)
    x = x.view(seq_len, concat_n, feature_dim).permute(
        1, 0, 2)  # concat_n, seq_len, feature_dim
    mid = (concat_n // 2)
    for r_idx in range(1, mid+1):
        x[mid + r_idx, :] = shift(x[mid + r_idx], r_idx)
        x[mid - r_idx, :] = shift(x[mid - r_idx], -r_idx)

    return x.permute(1, 0, 2).view(seq_len, concat_n * feature_dim)


def preprocess_data(split, feat_dir, phone_path, concat_nframes, train_ratio=0.8, train_val_seed=1337):
    class_num = 41  # NOTE: pre-computed, should not need change
    mode = 'train' if (split == 'train' or split == 'val') else 'test'

    label_dict = {}
    if mode != 'test':
        phone_file = open(os.path.join(
            phone_path, f'{mode}_labels.txt')).readlines()

        for line in phone_file:
            line = line.strip('\n').split(' ')
            label_dict[line[0]] = [int(p) for p in line[1:]]

    if split == 'train' or split == 'val':
        # split training and validation data
        usage_list = open(os.path.join(
            phone_path, 'train_split.txt')).readlines()
        random.seed(train_val_seed)
        random.shuffle(usage_list)
        percent = int(len(usage_list) * train_ratio)
        usage_list = usage_list[:percent] if split == 'train' else usage_list[percent:]
    elif split == 'test':
        usage_list = open(os.path.join(
            phone_path, 'test_split.txt')).readlines()
    else:
        raise ValueError(
            'Invalid \'split\' argument for dataset: PhoneDataset!')

    usage_list = [line.strip('\n') for line in usage_list]
    print('[Dataset] - # phone classes: ' + str(class_num) +
          ', number of utterances for ' + split + ': ' + str(len(usage_list)))

    max_len = 3000000
    X = torch.empty(max_len, 39 * concat_nframes)
    if mode != 'test':
        y = torch.empty(max_len, concat_nframes, dtype=torch.long)

    idx = 0
    for i, fname in tqdm(enumerate(usage_list)):
        feat = load_feat(os.path.join(feat_dir, mode, f'{fname}.pt'))
        cur_len = len(feat)
        feat = concat_feat(feat, concat_nframes)
        if mode != 'test':
            label = torch.LongTensor(label_dict[fname]).unsqueeze(1)
            label = concat_feat(label, concat_nframes)

        X[idx: idx + cur_len, :] = feat
        if mode != 'test':
            y[idx: idx + cur_len] = label

        idx += cur_len

    X = X[:idx, :]
    if mode != 'test':
        y = y[:idx]

    print(f'[INFO] {split} set')
    print(X.shape)
    if mode != 'test':
        print(y.shape)
        return X, y
    else:
        return X


class LibriDataset(Dataset):
    def __init__(self, X, y=None):
        self.data = X
        if y is not None:
            self.label = torch.LongTensor(y)
        else:
            self.label = None

    def __getitem__(self, idx):
        if self.label is not None:
            return self.data[idx].view(-1, 39), self.label[idx]
        else:
            return self.data[idx].view(-1, 39)

    def __len__(self):
        return len(self.data)


class BiLSTM(nn.Module):
    def __init__(self, class_size=41, input_dim=39, hidden_dim=256, dropout=0.35):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.class_size = class_size
        self.lstm = nn.LSTM(input_dim, hidden_dim // 2, dropout=dropout,
                            num_layers=6, bidirectional=True, batch_first=True)
        self.hidden2tag = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, class_size)
        )

    def forward(self, x):
        feats, _ = self.lstm(x)
        return self.hidden2tag(feats)


class Crf(nn.Module):
    def __init__(self, class_size=41):
        super().__init__()
        self.class_size = class_size
        self.crf = CRF(self.class_size, batch_first=True)

    def likelihood(self, x, y):
        return self.crf(x, y)

    def forward(self, x):
        return torch.tensor(self.crf.decode(x), dtype=torch.long, device=x.device)


# data prarameters
# the number of frames to concat with, n must be odd (total 2k+1 = n frames)
concat_nframes = 29
mid = concat_nframes//2
# the ratio of data used for training, the rest will be used for validation
train_ratio = 0.9

# training parameters
seed = 0                        # random seed
batch_size = 512                # batch size
num_epoch = 50                   # the number of training epoch
early_stopping = 8
learning_rate = 0.0001  # learning rate
model1_path = './model1.ckpt'     # the path where the checkpoint will be saved
model2_path = './model2.ckpt'
# model parameters
# the input dim of the model, you should not change the value
input_dim = 39 * concat_nframes
hidden_layers = 3              # the number of hidden layers
hidden_dim = 1024              # the hidden dim

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print(f'DEVICE: {device}')


# fix seed

def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


# preprocess data
train_X, train_y = preprocess_data(split='train', feat_dir='./libriphone/feat',
                                   phone_path='./libriphone', concat_nframes=concat_nframes, train_ratio=train_ratio)
val_X, val_y = preprocess_data(split='val', feat_dir='./libriphone/feat',
                               phone_path='./libriphone', concat_nframes=concat_nframes, train_ratio=train_ratio)

# get dataset
train_set = LibriDataset(train_X, train_y)
val_set = LibriDataset(val_X, val_y)

# remove raw feature to save memory
del train_X, train_y, val_X, val_y
gc.collect()

# get dataloader
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False)

# fix random seed
same_seeds(seed)

# create model, define a loss function, and optimizer
bilstm = BiLSTM().to(device)
crf = Crf().to(device)
optimizer1 = torch.optim.AdamW(
    bilstm.parameters(), lr=learning_rate*25, weight_decay=0.015)
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer1,
                                                                 T_0=8, T_mult=2, eta_min=learning_rate/2)

optimizer2 = torch.optim.AdamW(
    crf.parameters(), lr=learning_rate*750, weight_decay=1e-7)

total_num = 0
for i, param in enumerate(bilstm.parameters()):
    print('Layer:', i, '    parameter num:',
          param.numel(), '    shape:', param.shape)
    total_num += param.numel()

print(f'Total parameters num: {total_num}')

total_num = 0
for i, param in enumerate(crf.parameters()):
    print('Layer:', i, '    parameter num:',
          param.numel(), '    shape:', param.shape)
    total_num += param.numel()

print(f'Total parameters num: {total_num}')

best_acc = 0.0
early_stop_count = 0
for epoch in range(num_epoch):
    train_acc = 0.0
    train_loss = 0.0
    val_acc = 0.0
    val_loss = 0.0
    train_item = 0
    # training
    bilstm.train()  # set the model to training mode
    crf.train()
    pbar = tqdm(train_loader, ncols=110)
    pbar.set_description(f'T: {epoch+1}/{num_epoch}')
    samples = 0
    for i, batch in enumerate(pbar):
        features, labels = batch
        features, labels = features.to(device), labels.to(device)

        optimizer1.zero_grad()
        optimizer2.zero_grad()
        loss = -crf.likelihood(bilstm(features), labels)
        loss.backward()
        grad_norm = nn.utils.clip_grad_norm_(bilstm.parameters(), max_norm=50)
        optimizer1.step()
        optimizer2.step()

        train_loss += loss.item()
        train_item += labels.size(0)

        lr1 = optimizer1.param_groups[0]["lr"]
        lr2 = optimizer2.param_groups[0]["lr"]
        pbar.set_postfix(
            {'lr1': lr1, 'lr2': lr2, 'loss': train_loss/train_item})
    scheduler.step()
    pbar.close()
    # validation
    if len(val_set) > 0:
        bilstm.eval()  # set the model to evaluation mode
        crf.eval()
        with torch.no_grad():
            pbar = tqdm(val_loader, ncols=110)
            pbar.set_description(f'V: {epoch+1}/{num_epoch}')
            samples = 0
            for i, batch in enumerate(pbar):
                features, labels = batch
                features, labels = features.to(device), labels.to(device)
                outputs = crf(bilstm(features))
                val_acc += (outputs[:, mid] == labels[:, mid]).sum().item()
                samples += labels.size(0)
                pbar.set_postfix({'val acc': val_acc/samples})
            pbar.close()
            # if the model improves, save a checkpoint at this epoch
        if val_acc > best_acc:
            best_acc = val_acc
            torch.save(bilstm.state_dict(), model1_path)
            torch.save(crf.state_dict(), model2_path)
            print('saving model with acc {:.3f}'.format(
                best_acc/(len(val_set))))
            early_stop_count = 0
        else:
            early_stop_count += 1
            if early_stop_count >= early_stopping:
                print(
                    f"Epoch: {epoch + 1}, model not improving, early stopping.")
                break

del train_loader, val_loader
gc.collect()

# load data
test_X = preprocess_data(split='test', feat_dir='./libriphone/feat',
                         phone_path='./libriphone', concat_nframes=concat_nframes)

test_set = LibriDataset(test_X)

del test_X
gc.collect()

# get dataloader
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)

# load model
# model = Classifier(input_dim=input_dim, hidden_layers=hidden_layers, hidden_dim=hidden_dim).to(device)
bilstm = BiLSTM().to(device)
bilstm.load_state_dict(torch.load(model1_path))

crf = Crf().to(device)
crf.load_state_dict(torch.load(model2_path))

pred = np.array([], dtype=np.int32)

bilstm.eval()
crf.eval()
with torch.no_grad():
    for features in tqdm(test_loader):
        features = features.to(device)
        outputs = crf(bilstm(features))
        pred = np.concatenate((pred, outputs.detach().cpu()[:, mid]), axis=0)

with open('prediction.csv', 'w') as f:
    f.write('Id,Class\n')
    for i, y in enumerate(pred):
        f.write('{},{}\n'.format(i, y))

HW3 CNN: Food Classification

根据图片预测食物的种类。

感觉意义不大，索性懒得折腾了，直接用的预训练模型resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)。

训练中遇到了诡异的问题，运行.ipynb（助教提供的代码）时，CUDA直接报错，但是将代码原封不动转成.py后，运行就毫无问题。

参考代码：

ml2022spring_hw3.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
import pandas as pd
import torch
import os
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torchvision import models
from torchvision.transforms import v2
import torch.nn as nn
from PIL import Image
from tqdm import tqdm
import numpy as np

seed = 0
batch_size = 64
learning_rate = 0.001
num_epoch = 10
seed = 0
num_classes = 11
early_stopping = 8
model_path = 'model.ckpt'


class LibriDataset(Dataset):
    def __init__(self, path, X, y=None, transform=None):
        self.files = X
        self.path = path
        self.transform = transform
        if y is not None:
            self.label = y
        else:
            self.label = None

    def __getitem__(self, idx):
        data = Image.open(os.path.join(self.path, self.files[idx]))
        if self.label is not None:
            return self.files[idx], self.transform(data), self.label[idx]
        else:
            return self.files[idx], self.transform(data)

    def __len__(self):
        return len(self.files)


def load_data(path, mode, transform=None):
    X, y = [], []
    for file in os.listdir(path):
        if file.endswith('.jpg'):
            X.append(file)
            if mode in ['train', 'val']:
                y.append(int(file.split('_')[0]))

    if mode in ['train', 'val']:
        return LibriDataset(path, X, y, transform=transform)
    else:
        return LibriDataset(path, X, transform=transform2)


def same_seeds(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True


same_seeds(seed)

transform = v2.Compose([
    v2.ToImage(),
    v2.Resize((512, 512), antialias=True),
    v2.RandomHorizontalFlip(),
    v2.RandomVerticalFlip(),
    v2.RandomRotation(15),
    v2.RandomPerspective(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

transform2 = v2.Compose([
    v2.ToImage(),
    v2.Resize((512, 512), antialias=True),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

train_set = load_data('./food11/training/', mode='train', transform=transform)
val_set = load_data('./food11/validation/', mode='val', transform=transform)
test_set = load_data('./food11/test/', mode='test', transform=transform)

print('train dataset: {} images'.format(len(train_set)))
print('val dataset: {} images'.format(len(val_set)))
print('test dataset: {} images'.format(len(test_set)))

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('device:', device)

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)

model = models.resnet50(
    weights=models.ResNet50_Weights.IMAGENET1K_V2).to(device)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, num_classes).to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer, T_0=8, T_mult=2, eta_min=learning_rate / 10)

best_acc = 0.0
for epoch in range(num_epoch):
    train_acc = 0.0
    train_loss = 0.0
    val_acc = 0.0
    val_loss = 0.0
    train_item = 0

    # training
    model.train()  # set the model to training mode
    pbar = tqdm(train_loader, ncols=110)
    pbar.set_description(f'T: {epoch+1}/{num_epoch}')
    for i, batch in enumerate(pbar):
        _, features, labels = batch
        features = features.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(features)

        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # get the index of the class with the highest probability
        _, train_pred = torch.max(outputs, 1)
        train_acc += (train_pred.detach() == labels.detach()).sum().item()
        train_loss += loss.item()
        train_item += labels.size(0)
        lr = optimizer.param_groups[0]["lr"]
        pbar.set_postfix(
            {'lr': lr, 'loss': train_loss/train_item})
    scheduler.step()
    pbar.close()
    # validation
    if len(val_set) > 0:
        model.eval()  # set the model to evaluation mode
        with torch.no_grad():
            pbar = tqdm(val_loader, ncols=110)
            pbar.set_description(f'V: {epoch+1}/{num_epoch}')
            samples = 0
            for i, batch in enumerate(pbar):
                _, features, labels = batch
                features = features.to(device)
                labels = labels.to(device)
                outputs = model(features)

                loss = criterion(outputs, labels)

                _, val_pred = torch.max(outputs, 1)
                # get the index of the class with the highest probability
                val_acc += (val_pred.cpu() == labels.cpu()).sum().item()
                val_loss += loss.item()
                samples += labels.size(0)
                pbar.set_postfix({'val acc': val_acc/samples})
            pbar.close()

            print('[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f} | Val Acc: {:3.6f} loss: {:3.6f}'.format(
                epoch + 1, num_epoch, train_acc/len(train_set), train_loss/len(
                    train_loader), val_acc/len(val_set), val_loss/len(val_loader)
            ))

            # if the model improves, save a checkpoint at this epoch
            if val_acc > best_acc:
                best_acc = val_acc
                torch.save(model.state_dict(), model_path)
                print('saving model with acc {:.3f}'.format(
                    best_acc/(len(val_set))))
                early_stop_count = 0
            else:
                early_stop_count += 1
                if early_stop_count >= early_stopping:
                    print(
                        f"Epoch: {epoch + 1}, model not improving, early stopping.")
                    break
    else:
        print('[{:03d}/{:03d}] Train Acc: {:3.6f} Loss: {:3.6f}'.format(epoch + 1,
              num_epoch, train_acc / len(train_set), train_loss/len(train_loader)))

    # if not validating, save the last epoch
if len(val_set) == 0:
    torch.save(model.state_dict(), model_path)
    print('saving model at last epoch')

test_acc = 0.0
test_lengths = 0
result = []

model.eval()
with torch.no_grad():
    for i, batch in enumerate(tqdm(test_loader)):
        filename, features = batch
        features = features.to(device)
        outputs = model(features)

        # get the index of the class with the highest probability
        _, test_pred = torch.max(outputs, 1)
        # result = [(filename, pred_label)]
        result.extend(zip(filename, test_pred.cpu().numpy()))

with open('result.txt', 'w') as f:
    for filename, pred_label in result:
        f.write('{}\t{}\n'.format(filename, pred_label))

目录