Commit f0e4d5ff authored by Guillaume Slizewicz's avatar Guillaume Slizewicz

Upload New File

parent f97df949
# First hours with CamemBERT
This post is an introduction to working with CamemBERT via huggingface quickstart guides in order to do tokenisation and masked word prediction.
You can find the huggingface quickstart guide here: https://huggingface.co/transformers/quickstart.html
## Tokenisation
It took us a few minutes to understand how to transpose BERT quickstart guide to CamemBERT and a deep dive into the documentation.
the first things to change are the imports
```python
import torch
from transformers import CamembertConfig, CamembertForTokenClassification, CamembertTokenizer,CamembertForMaskedLM
```
```python
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)
```
And also change the pre-trained model
```python
# Load pre-trained model tokenizer (vocabulary)
tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
```
INFO:transformers.tokenization_utils:loading file https://s3.amazonaws.com/models.huggingface.co/bert/camembert-base-sentencepiece.bpe.model from cache at /Users/u0122145/.cache/torch/transformers/3715e3a4a2de48834619b2a6f48979e13ddff5cabfb1f3409db689f9ce3bb98f.28d30f926f545047fc59da64289371eef0fbdc0764ce9ec56f808a646fcfec59
The following part took us a while to crack. First of all the special tokens for CamemBERT are derived from roBERTa, they are thus different from the one from BERT
`'[CLS]' becomes '<s>'
​ '[SEP]' becomes '</s>'``
​ `'[MASK]'becomes '<mask>'```
What's more RoBERTa uses another special character at the beggining of each token, which looks like an underscore but is not, as it is ▁
```python
text = '<s> Qui est Jean ? </s> Jean était un homme politique</s>'
tokenized_text = tokenizer.tokenize(text)
print(tokenized_text)
assert tokenized_text == ['<s>', '▁Qui', '▁est', '▁Jean', '▁?', '</s>', '▁Jean', '▁était', '▁un', '▁homme', '▁politique', '</s>']
```
['<s>', '▁Qui', '▁est', '▁Jean', '▁?', '</s>', '▁Jean', '▁était', '▁un', '▁homme', '▁politique', '</s>']
```python
masked_index = 6
tokenized_text[masked_index] = '<mask>'
print(tokenized_text)
assert tokenized_text == ['<s>', '▁Qui', '▁est', '▁Jean', '▁?', '</s>', '<mask>', '▁était', '▁un', '▁homme', '▁politique', '</s>']
```
['<s>', '▁Qui', '▁est', '▁Jean', '▁?', '</s>', '<mask>', '▁était', '▁un', '▁homme', '▁politique', '</s>']
```python
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
```
In the following snippet you must also be careful to change the number of segment ids
```python
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
print(indexed_tokens, len((indexed_tokens)), len(segments_ids))
```
[5, 1470, 30, 470, 106, 6, 32004, 149, 23, 421, 462, 6] 12 12
```python
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
labels = torch.tensor([1] * tokens_tensor.size(1)).unsqueeze(0)
```
## Prediction
Be sure to change the model here as well
```python
# Load pre-trained model (weights)
model = CamembertForMaskedLM.from_pretrained('camembert-base')
model.eval()
```
Here the syntax changes a little bit from BERT as well, the arguments are not formulated in the exact same way.
token_type_ids becomes masked_lm_labels
```python
with torch.no_grad():
outputs = model(tokens_tensor, masked_lm_labels=tokens_tensor)
print(outputs)
predictions = outputs[1]
```
```python
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
```
```python
print(predictions)
```
```python
print(predicted_token)
```
▁Jean
## Thoughts
Even Though Huggingface's transformers is a wonderful tool for doing NLP and exploring the models in an easier way than to find and download each model, their claim that "tokenizer and base model’s API are standardized to easily switch between models." is not yet attained and the switch between models still requires an understanding of the different models and the opening of many tabs and model descriptions. What's more, many of the terms used can be a little frightening when starting using these model and require time to recognise and understand.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment