Building Tweet Classification Models with BERT 🤗

文本分類是最初接觸 NLP 領域大概最先會碰到的任務類型，其中傳統的做法會使用詞頻、TF-IDF、Word2vec 等方法將文本項量化後再接 SVM, XGBoost 等分類器進行實作。隨著 Transformers 架構出現，大家開始嘗試使用 BERT 等模型來進行文本分類，並且得到不錯的結果，而隨著開源社群的茁壯貢獻，現在想要調用 Bert 進行推理、訓練都已經不是難事，今天以 Hugging Face 的 transformers 套件來進行實作。
所以今天的 Tutorial 為實作以 Bert 進行文本分類，其中用到的資料集為來自 Kaggle 公開競賽「Natural Language Processing with Disaster Tweets」，競賽目的希望透過人們在 X 上的推文預測內容是否正在描述災難（disaster）的情況。

🤗 你將在本篇文了解：

如何從 Hugging Face 載入各種大型語言模型
transformers、Tokenizer、Dataset 功能與使用方法
如何使用 Trainer API 微調模型

requirements

💻 資料集：kaggle Link
程式碼：Github Link

筆者套件版本：
🐍 python 3.7+
🐍 torch 2.1.1
🤗 transformers 4.28.0
🤗 datasets 2.13.1

在訓練時我是直接於 Colab 環境執行，套件的部分直接 pip 即可：

1 2	!pip install transformers !pip install datasets evaluate accelerate

Data Description

資料集包含推文 id, keyword, location 與 text 以及 target，text 為推文內容，是我們主要要使用的資料，而 target 則是要預測的分類標籤，分別為 0 (not disaster) 和 1 (disaster)。

在任何任務中資料清洗都是很重要的一環，我想 DS 大約有 80% 的時間都花在資料清洗、資料工程這種下水道工程之中。了解資料，知道資料有什麼異常、分佈需要花費非常多的時間，礙於文章篇幅，這邊在 Github 提供程式與清洗後的資料集，其中資料清洗主要參考了「NLP with Disaster Tweets - EDA, Cleaning and BERT 」，作者對資料做了很深入的研究與清洗，包含還原縮寫、亂碼等，也糾正了標記錯誤的資料。我將作者用來做 re.sub 的內容統整到 clean_mapping.xlsx 中，並將處理完的資料放在 repo 中，可直接下載，接下來將會使用這份清洗過的資料。

1 2	df_train = pd.read_excel('./data/df_train.xlsx') df_test = pd.read_excel('./data/df_text.xlsx')

BERT 訓練

Hugging Face & DistilBERT Introuduciton

雖然說是 BERT，但其實我所使用的是 DistilBERT，該模型是 Hugging Face 在 2019 所提出的論文，該架構與 Bert 相似，是利用蒸餾技術降低模型大小，在保有一定準確度下提升模型訓練速度，對於細節有興趣的可以參考這篇論文。但在 transformers 中使用這兩個模型的差異只在於指定的模型確認站 checkpoint 不一樣而已。

微調步驟大致為以下：

將資料轉為 Dataset 格式
建立 Tokenizer 並將資料轉換成 input_ids
利用 datasets.load_metrc 定義模型訓練評估指標
建立 Trainer 開始微調模型
將儲存好的模型與 tokenizer chekpoint 匯出

其中使用的模型為 Hugging face 的 Models，可以直接至 🤗 Hugging Face Hub 搜尋想要的模型，例如我們今天所使用的模型: distilbert/distilbert-base-uncased，

在 Model card 中可以看到模型的簡介以及下載使用的方法，例如：

1
2
3

from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("Hello I'm a [MASK] model.")

右側有 Inference API 提供模型 Demo 的使用。如果想知道到底有什麼模型可以使用的話，也可以點擊主頁上方的 Models，左側可以依據 Tasks 選取，例如 Natural Labguage Processing 的 Text Classification 或是 Question Answering 等等。

Data Preprocessing：Tokenization & Dataset

Tokenizer

🤗 Transformers Tokenizer

先介紹相對重要的 Tokenization 概念。
Tokenizer 中文叫分詞器，主要功能為將文字轉為數字序列，例如將「我愛你」轉為 [2057, 1014, 1012]，其中 2057 代表「我」的索引，1014 代表「愛」的索引，1012 代表「你」的索引。其中涉及到 1. 如何斷詞 2. 轉為數字序列。

斷詞的方法有很多，例如在英文中可以使用 split() 以空格基於單字（Word-based）切割，或是中文常以 jieba 進行斷詞。

1
2
3

tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)
>['Jim', 'Henson', 'was', 'a', 'puppeteer']

盤出所有斷詞結果後會建立一個詞彙表 vocabulary，將所有可能出現的詞賦予獨立的 ID 索引，提供模型識別每個單字，其中也會有模型訓練中可能會需要的特殊字詞，例如 [UNK] 代表詞彙表中沒有出現過的字，還有如 [CLS]、[SEP] 等，而這些 ID，或是所謂斷詞後的字，我們將稱作為 token。其他例如 charactor-based、subwork-based 等也是常見的斷詞技術，而在 LLM 中更常見的如 GPT-2 所使用的 Byte-level BPE、BERT 使用的 WordPiece 或是 SenetencePiece 等。

而從上述介紹，可以想像不同的模型所使用的 Tokenizer 很容易會有所不同，基於訓練資料、語言等等，因此在 transformers 中，我們會設定想要載入的模型 checkpoint，例如 Google/BERT，同時載入相對應的模型權重與 Tokenizer，如此就會將剛剛我提到的 vocabulay 與其他需要資料下載下來。

在 transformers 中，我們可以透過兩種方法載入 Tokenizer：

指定 Tokenizer 名稱， from transformers import DistilBertTokenizer
使用 AutoTokenizer ，自動偵測並載入對應 checkpoint 的 tokenizer，
我們將 checkpoint 指定給 AutoTokenizer，調用 .from_pretrained() 即可，其中 checkpoint 為模型名稱則為從 Hugging Face 下載，或是直接指定到本地端的模型路徑。

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased" # huggingface checkpoint
# checkpoint = "./model/your_local_checkpoint_path" # local checkpoint
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer 中我們可以利用以下 function 做分詞相關操作：
.tokenize() 將文本進行斷詞
.convert_tokens_to_ids() 將斷詞結果轉為 ID 序列
.decode() 將 ID 序列還原為 token 文字

text = df_train['text_cleaned'].iloc[0]
tokens = tokenizer.tokenize(text)
print(f"斷詞結果：{tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"分詞 id 序列: {ids}")

decoded_string = tokenizer.decode(ids)
print(f'decode 還原：{decoded_string}')

1
2
3

>斷詞結果：['our', 'deeds', 'are', 'the', 'reason', 'of', 'this', '#', 'earthquake', 'may', 'allah', 'forgive', 'us', 'all']
>分詞 id 序列: [2256, 15616, 2024, 1996, 3114, 1997, 2023, 1001, 8372, 2089, 16455, 9641, 2149, 2035]
>decode 還原：our deeds are the reason of this # earthquake may allah forgive us all

其中在 tokenizer 有 padding, truncation, max_length 參數以供設定，細節可看 Padding and truncation。

最後，我們直接 call 實體化後的分詞器進行分詞（推薦）。

result = tokenizer(text,
                   return_tensors="pt")
print(f'result: {result}')

print(tokenizer.decode(result['input_ids'][0]))

1
2
3

>result: {'input_ids': tensor([[  101,  2256, 15616,  2024,  1996,  3114,  1997,  2023,  1001,  8372,
          2089, 16455,  9641,  2149,  2035,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
>[CLS] our deeds are the reason of this # earthquake may allah forgive us all [SEP]

如此一來我們完成了分詞的操作，接下來要做的是將其包裝成一個 function 讓我們可以直接對所有資料及進行分詞。

Dataset

🤗 Transformers Datasets
如同使用 PyTorch 時，我們需要 torch.utils.data.Dataset 將我們的資料包裝起來，並透過 torch.utils.data.DataLoader 定義每一次 Batch 的抽樣參數，在 transformers 中，我們利用 Datasets 定義載入資料的方法，包裝成 datasets 資料格式，Dataset 在 Apache Arrow 格式的支援下，再處理大型資料及上獲得更快的效率與速度。

我們想要做的事是將載進來的 pd.Dataframe 轉為 Datasets 並且對所資料進行 tokenizer，
所以我們需要：

定義 tokenizer function
將 pd.Dataframe 轉為 Dataset，並且透過 map 與 batched=True 快速地將資料進行分詞。

from datasets import Dataset

def do_tokenizer(data: Dataset):
  return tokenizer(data["text_cleaned"],
                   truncation=True,
                   )

# train, dev, test split
df_trian, df_dev = train_test_split(df_train, test_size=0.2, random_state=42)

# transform to dataset
ds_train = Dataset.from_pandas(df_train)
ds_dev = Dataset.from_pandas(df_dev)
ds_test = Dataset.from_pandas(df_test)

1
2
3

ds_train = ds_train.map(do_tokenizer, batched=True)
ds_dev = ds_dev.map(do_tokenizer, batched=True)
ds_test = ds_test.map(do_tokenizer, batched=True)

如此一來我們已經將資料中的 text 欄位透過 tokenizer 轉為分詞後的 input_ids 序列，我們接下來要做的最後一件事情是將所有的資料長度填充到一樣長，可以看到下方，我們列出前 10 筆資料的 input_ids 長度，可以發現其實有些序列長度不同。所以我們要做的事是，依據文本中最長的序列長度，將其他序列補齊到相同長度，這種手法叫做 padding，可以看到下方，

1	[len(input_ids) for input_ids in ds_train[:10]['input_ids']]

1	>[16, 12, 27, 14, 22, 29, 20, 21, 15, 16, 11]

而我們要使用的 padding 方法是 Dynamic Padding 動態補長，透過 DataCollatorWithPadding 進行設定實作，
為什麼要動態補長呢？如果依據整個資料集的最長序列進行補齊，可能會浪費掉太多不必要的空間與運算時間，我們其實只需要模型每一次的輸入中，確保各自 batch 間的序列長度一樣即可，所以透過 DataCollatorWithPadding，抽取 batch 時進行 padding，這將為加速訓練時的速度。（注意：這種作法可能會導致在 TPU 上產生錯誤，TPU 更偏好資料為固定長度）

from transformers import DataCollatorWithPadding

# 動態 padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

最後的最後，我們將不需要的欄位去除、重新命名，並定義 label 對應的標籤，即可開使進行訓練了！

col2remove = ['id', 'keyword', 'location', 'text']

ds_train = ds_train.remove_columns(col2remove)
ds_dev = ds_dev.remove_columns(col2remove)
ds_test = ds_test.remove_columns(col2remove)

ds_train = ds_train.rename_column("target_relabeled", "label")
ds_dev = ds_dev.rename_column("target_relabeled", "label")

id2label = {0: "NOT",
            1: "YES"}
label2id = {v: k for k, v in id2label.items()}

Fine tune with Trainer API

🤗 Transformers Trainer API

transformers 提供 Trainer 進行資料的微調，只需要簡單的幾步驟設定，即可快速使用 Trainer.train() 開始訓練，而不用自己寫 training loop。

How to donwload models

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_metric

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint,
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

訓練前，我們要載入模型的 pretrained weights，與 Tokenizer 下載方法雷同，我們可以直接指定 checkpoint 利用 .from_pretrained() 進行下載，比較不一樣的是，雖然是使用 AutoClass 方法建立，
但我們會依據不同的 task 使用不同的 AutoClass，例如我們這次要做的是分類任務，就會是 AutoModelForSequenceClassification。

模型將依據不同的任務，自動在載入模型後將模型架構改為適合該任務的架構，例如分類任務會在最後新增一層依據分類類別個數而定義的分類層，而依據任務而新增的神經層權重會是隨機初始化的，可以從下載模型後的 Warning 看到：

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: [‘classifier.weight’, ‘pre_classifier.bias’, ‘classifier.bias’, ‘pre_classifier.weight’]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

最後有興趣的話，我們可以在這裏看到還有什麼 AutoClass。

Where the weights is saved

下載好的權重預設會處存在快取資料夾中， ~/.cache/huggingface/transformers/，所以再次調用 .from_pretrained() 會預設從該資料夾進行下載，如果沒有找到資料才會從網路上下載，故使用 colab 如果斷線的話，是要重新下載權重的，而如果想更改預設儲存路徑的話，可以透過環境變數 HF_HOME 進行設定。

How to save the model

保存模型就如同載入模型，我們使用 .save_pretrained() 即可，但需要指定 output_dir 來指定模型儲存的位置。

1 2	output_dir = "directory_on_my_computer" model.save_pretrained(output_dir)

將會有兩份文件被處存，1. config.json 保存模型的設定，2. pytorch_model.bin 保存模型的權重。

Setting the Trainer arguments

TrainingArguments

定義 Trainer 之前要先定義 TraingArguments，這些參數包含了控制訓練的各種設定，其中必須提供的參數是保存參數的路徑，其餘的設定可以使用預設，也可以自己嘗試調整以優化模型，例如我列出了一些常用的參數：

seed：設定隨機種子，用來產生隨機數，以便於重現結果
learning_rate：設定學習率
per_device_train_batch_size：設定每個 GPU 的訓練 batch size
eval_steps：設定模型經過多少 steps 進行評估
save_steps：設定模型經過多少 steps 進行保存
evaluation_strategy：設定評估策略，可以選擇 steps 或 epoch，

其中 evaluation_strategy 有別於 early stopping，我們透過這個參數，在模型訓練結束後，會依據所選的 steps 或是 epoch 處存的模型，自動去找到最好的模型。

想看更多 TrainingArguments 的細節可以在這裡找到

training_args = TrainingArguments(
    output_dir = './model/',
    learning_rate=2e-5,
    seed=11,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    weight_decay=0.01,
    eval_steps=600,
    save_steps=600,
    evaluation_strategy="steps",
    save_strategy="steps",
    load_best_model_at_end=True,
)

Comput Metrics

訓練模型中，我們需要評估指標來告訴我們目前模型的表現如何，在 Datasets 中，提供了各種 NLP 常見的指標，可以使用 list_metrics() 查看有哪些指標可以使用。

from datasets import list_metrics
metrics = list_metrics()
print(metrcs_list)
>['accuracy', 'bertscore', 'bleu', 'bleurt', 'cer', 
'comet', 'coval', 'cuad', 'f1', 'gleu', 'glue', 'indic_glue', 
'matthews_correlation', 'meteor', 'pearsonr', 'precision', 
'recall', 'rouge', 'sacrebleu', 'sari', 'seqeval', 'spearmanr', 
'squad', 'squad_v2', 'super_glue', 'wer', 'wiki_split', 'xnli']

我們可以使用 load_metric() 從 Hub 中載入指標，可以直接指定 metric = load_metric('accuracy') 也可以自定義技術函數，在此之前我們需要先了解 metric 回傳的資訊有什麼，可以從 datasets.MetricInfo 得到更多的細節。

metric 以 .comput() 方法計算 predictions 與 referece 之間的分數，並回傳字典，字典的 key 為 metric 名稱，value 為計算的分數，所以我們如果想要自定義計算方法的話，記得要回傳字典格式。

def compute_metrics(eval_pred):
    load_acc = load_metric('accuracy')
    load_f1 = load_metric('f1')
    logits,labels = eval_pred
    predictions = np.argmax(logits,axis = -1)
    acc = load_acc.compute(predictions = predictions,references = labels)['accuracy']
    f1 = load_f1.compute(predictions = predictions, references = labels)['f1']
    return {'acc':acc,'f1':f1}

經過以上步驟，我們已經完成了訓練前的前置作業，包含

tokenizer 進行分詞
將資料轉為 Dataset
定義 compute_metrics 計算指標
定義 TrainingArguments 設定訓練參數

最後我們只需要實體化 Trainer，並將這些資料與參數傳入即可開始訓練！

the last but not the least, Train!

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_train,
    eval_dataset=ds_dev,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

1	trainer.train()

這就開始微調訓練了！並且在訓練期間會依據我們設定的 step 對驗證集進行驗證告訴我們目前模型表現狀況，且儲存模型。

1 2	pred = trainer.predict(ds_dev) pred = np.argmax(pred.predictions, axis=-1)

訓練完畢後我們可以直接對 Trainer 調用 .predict() 函數，傳入測試集，得到模型預測的結果。

以上就完成 Bert 進行分類任務訓練微調啦 🎉

這次的介紹為簡單帶過整個流程，如果想知道更多的細節可以看 Hugging Face 的官方文件，其實寫得非常完整，也有針對各字模組做相關的教學，非常推薦！
官方教學在這裡！

Reference

🤗 Hugging Face NLP Course
NLP with Disaster Tweets - EDA, Cleaning and BERT