# First hours with CamemBERT
This post is an introduction to working with CamemBERT via huggingface quickstart guides in order to do tokenisation and masked word prediction.
You can find the huggingface quickstart guide here:
## Tokenisation
It took us a few minutes to understand how to transpose BERT quickstart guide to CamemBERT and a deep dive into the documentation.
the first things to change are the imports
import torch
from transformers import CamembertConfig, CamembertForTokenClassification, CamembertTokenizer,CamembertForMaskedLM
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
And also change the pre-trained model
# Load pre-trained model tokenizer (vocabulary)
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
INFO:transformers.tokenization_utils:loading file from cache at /Users/u0122145/.cache/torch/transformers/3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59
The following part took us a while to crack. First of all the special tokens for CamemBERT are derived from roBERTa, they are thus different from the one from BERT
`'[CLS]' becomes '<s>'
​ '[SEP]' becomes '</s>'``
​ `'[MASK]'becomes '<mask>'```
What's more RoBERTa uses another special character at the beggining of each token, which looks like an underscore but is not, as it is ▁
text = '<s> Qui est Jean ? </s> Jean était un homme politique</s>'
tokenized_text = tokenizer.tokenize(text)
assert tokenized_text == ['<s>', '▁Qui', '▁est', '▁Jean', '▁?', '</s>', '▁Jean', '▁était', '▁un', '▁homme', '▁politique', '</s>']
['<s>', '▁Qui', '▁est', '▁Jean', '▁?', '</s>', '▁Jean', '▁était', '▁un', '▁homme', '▁politique', '</s>']
masked_index = 6
tokenized_text[masked_index] = '<mask>'
assert tokenized_text == ['<s>', '▁Qui', '▁est', '▁Jean', '▁?', '</s>', '<mask>', '▁était', '▁un', '▁homme', '▁politique', '</s>']
['<s>', '▁Qui', '▁est', '▁Jean', '▁?', '</s>', '<mask>', '▁était', '▁un', '▁homme', '▁politique', '</s>']
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
In the following snippet you must also be careful to change the number of segment ids
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
print(indexed_tokens, len((indexed_tokens)), len(segments_ids))
[5, 1470, 30, 470, 106, 6, 32004, 149, 23, 421, 462, 6] 12 12
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
labels = torch.tensor([1] * tokens_tensor.size(1)).unsqueeze(0)
## Prediction
Be sure to change the model here as well
# Load pre-trained model (weights)
model = CamembertForMaskedLM.from_pretrained('camembert-base')
Here the syntax changes a little bit from BERT as well, the arguments are not formulated in the exact same way.
token_type_ids becomes masked_lm_labels
with torch.no_grad():
outputs = model(tokens_tensor, masked_lm_labels=tokens_tensor)
predictions = outputs[1]
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
## Thoughts
Even Though Huggingface's transformers is a wonderful tool for doing NLP and exploring the models in an easier way than to find and download each model, their claim that "tokenizer and base model’s API are standardized to easily switch between models." is not yet attained and the switch between models still requires an understanding of the different models and the opening of many tabs and model descriptions. What's more, many of the terms used can be a little frightening when starting using these model and require time to recognise and understand.
