1 Transformers model

1.1 For which tasks?

Classifying whole sentences:
- Examples: Sentiment analysis, spam detection, grammatical correctness, sentence relationship.
Classifying each word in a sentence:
- Examples: Grammatical components (noun, verb, adjective), named entity recognition (person, location, organization).
Generating text content:
- Examples: Autocomplete text from a prompt, fill in masked words in a text.
Extracting an answer from a text:
- Examples: Given a question and context, extract the answer based on the context.
Generating a new sentence from an input text:
- Examples: Translation, text summarization.

1.1.1 A challenging task

Human vs Machine Language Processing:
- Humans understand sentences like “I am hungry” easily, and can compare similar sentences like “I am hungry” and “I am sad”.
- Machine learning models find it more difficult to process and understand language, requiring careful text processing.
Introduction of Transformers:
- 2017: Introduction of Transformers by Vaswani et al. in the paper “Attention Is All You Need.”
  - Neural network architecture learning context and relationships from sequential data, initially focused on translation tasks.
Influential Transformer Models:
- June 2018: GPT - First pretrained transformer model, state-of-the-art results on various NLP tasks.
- October 2018: BERT - Designed to produce better sentence summaries.
- February 2019: GPT-2 - Improved version of GPT, not immediately released due to ethical concerns.
- October 2019: DistilBERT - A distilled version of BERT, 60% faster, 40% lighter, retaining 97% of BERT’s performance.
- October 2019: BART and T5 - Pretrained models using the original Transformer architecture.
- May 2020: GPT-3 - Larger model, capable of zero-shot learning.
- November 2022: GPT-3.5 and March 2023: GPT-4.
- Followed by other models like Llama, Claude, Gemini, Mistral.

2 Pretraining and Fine-tuning

Transformer models (GPT, BERT, BART, T5, etc.) are trained as language models on large amounts of raw text using self-supervised learning.
- Self-supervised learning: the model learns without human-labeled data, as the objective is computed automatically from the inputs.
- These models develop a statistical understanding of language but aren’t directly useful for specific tasks.
Transfer learning: the process where a pretrained model is fine-tuned on specific tasks using supervised learning (human-annotated labels).
- Pretrained models undergo fine-tuning to adapt to particular tasks.
Pretraining:
- The model is trained from scratch with randomly initialized weights.
- Pretraining is resource-intensive in terms of time, data, and money.
- It requires a large corpus and can take weeks to complete.
Fine-tuning:
- Performed after pretraining using a dataset specific to a task.
- Reasons to fine-tune instead of training from scratch:
  - The pretrained model shares similarities with the fine-tuning dataset.
  - Less data and time are required to achieve good results.
  - Lower costs in terms of time, data, financial, and environmental resources.
  - It allows for faster iteration and refinement.
Transfer learning advantages:
- Leverages knowledge from pretraining for improved task-specific performance.
- Fine-tuning achieves better results than training from scratch unless massive data is available.
- Using pretrained models close to the target task optimizes performance.

3 Transformers Library

The 🤗 Transformers library provides tools to create and use shared models.
Model Hub:
- Contains thousands of pretrained models for download and use.
- Users can also upload their own models to the Hub.
Pipeline() function:
- The most basic object in the library.
- Connects a model with necessary preprocessing and postprocessing steps.
- Allows direct input of text for intelligible output.
Three main steps in a pipeline:
1. Text is preprocessed into a format the model understands.
2. Preprocessed inputs are passed to the model.
3. Model predictions are post-processed for easy interpretation.
Available pipelines:
- feature-extraction (vector representation of a text)
- fill-mask
- ner (named entity recognition)
- question-answering
- sentiment-analysis
- summarization
- text-generation
- translation
- zero-shot-classification

import os
hf_token = "your-key"
custom_cache_dir = '/Users/peltouz/Documents/pretrain'

os.environ['HF_HOME'] = custom_cache_dir  # Hugging Face home directory for all HF operations
os.environ['TRANSFORMERS_CACHE'] = custom_cache_dir  # Transformers-specific cache directory
os.environ['HF_DATASETS_CACHE'] = custom_cache_dir  # Datasets-specific cache directory
os.environ['HF_METRICS_CACHE'] = custom_cache_dir  # Metrics-specific cache directory
os.environ['HF_TOKEN'] = hf_token  # Hugging Face API token

This Python code snippet configures the environment to specify custom cache directories for operations involving Hugging Face libraries, and it sets an API token for authentication.

API Token Configuration:
- hf_token: Assigned to a string representing the Hugging Face API token.
- Purpose: Used for authenticating and accessing Hugging Face services that require credentials.
Custom Cache Directory:
- custom_cache_dir: Set to D:/pretrain.
- Purpose: Specifies a base directory for storing cache files related to Hugging Face operations.
Environment Variables:
- HF_HOME: Specifies the base directory for all Hugging Face-related operations.
- TRANSFORMERS_CACHE: Directory for caching models downloaded via the transformers library.
- HF_DATASETS_CACHE: Directory for caching datasets accessed via the datasets library.
- HF_METRICS_CACHE: Directory for caching metrics-related files used in model evaluation.
- HF_TOKEN: Environment variable for the Hugging Face API token to authenticate requests to Hugging Face services.
Purpose of Environment Variables:
- Ensure all data related to models, datasets, and metrics are stored in the specified directory (D:/pretrain).
- Help manage disk space, especially when handling large models or datasets.
- Correctly configure credentials for accessing restricted services.

4 The Pipeline function

4.1 Sentiment Analysis

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
sentiments = classifier(
        ["I hate teaching",
         "I love programming"]
    )

print(sentiments)
## [{'label': 'NEGATIVE', 'score': 0.9989008903503418}, {'label': 'POSITIVE', 'score': 0.9998173117637634}]
sentiments[0]['label']
## 'NEGATIVE'

A pretrained model fine-tuned for sentiment analysis in English is selected by default.
The model is downloaded and cached when the classifier object is created.
Upon rerunning the command, the cached model is used without needing to download again.

4.2 Zero-shot classification

classifier = pipeline("zero-shot-classification")
classifier(
    "df %>% filter(!is.na(var1))",
    candidate_labels=["python", "Rstudio"],
)
## {'sequence': 'df %>% filter(!is.na(var1))', 'labels': ['Rstudio', 'python'], 'scores': [0.6543467044830322, 0.345653235912323]}

Main idea:
- Classifying unlabelled texts is a challenging task often encountered in real-world projects.
Comparison:
- Annotating text manually is time-consuming and requires domain expertise.
Key aspect:
- The zero-shot-classification pipeline is powerful because it lets you specify your own labels for classification, instead of relying on pretrained model labels.

4.3 Text generation

generator = pipeline("text-generation")
generator("In my programming course in DS2E I will")
## [{'generated_text': 'In my programming course in DS2E I will be using an alternative approach, rather than writing a single file. It will be very easy to write a simple C script with many variables and many methods.\n\nYou can find a list of the methods and parameters in the documentation.\n\nIt can be quite tricky to write an object to a file.\n\nIn this tutorial the object will be a file called "bundle.js", so we are using a JavaScript file called "bundle.js" with the following dependencies:\n\nbundle.js\n\nbundle.js-bundle.js-bundle.js-bundle.js-bundle.js-bundle.js\n\nThe following code will use the following code as the base of the code.\n\nvar b = require(\'bundle\'); b.bundle = function(b) { this.bundle = b.bundle; }; var bundle = new Bundle(); bundle.js = function(b) { this.bundle = b.bundle; };\n\nNote: the following code use the __init__ function to initialize a new instance of the bundle.js object. This is necessary because the __init__ function is called before the function is called.\n\n'}]

Main idea:
- A prompt is provided, and the model auto-completes it by generating the remaining text.
Comparison:
- Similar to the predictive text feature on phones.
Key aspect:
- Text generation involves randomness, so results may vary

4.4 Mask filling

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
## [{'score': 0.19620011746883392, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}, {'score': 0.04052743315696716, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}]

Main idea:
- The task involves filling in the blanks in a given text.
Comparison:
- Similar to completing sentences in cloze tests or gap-filling exercises.
Key aspect:
- The focus is on providing contextually appropriate words or phrases to complete the text.

4.5 Named entity recognition

ner = pipeline("ner", grouped_entities=True)
## /Users/peltouz/Documents/GitHub/M2-Py-DS2E/hf/lib/python3.13/site-packages/transformers/pipelines/token_classification.py:186: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead.
##   warnings.warn(
ner("My name is Pierre and I work at BETA in Strasbourg.")
## [{'entity_group': 'PER', 'score': np.float32(0.99918455), 'word': 'Pierre', 'start': 11, 'end': 17}, {'entity_group': 'ORG', 'score': np.float32(0.9977419), 'word': 'BETA', 'start': 32, 'end': 36}, {'entity_group': 'LOC', 'score': np.float32(0.9894748), 'word': 'Strasbourg', 'start': 40, 'end': 50}]

Main idea:
- Named entity recognition (NER) involves identifying parts of the input text that correspond to entities.
Comparison:
- NER focuses on entities like persons, locations, or organizations in the text.

4.6 Question answering

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Pierre and I work at BETA in Strasbourg.",
)
## {'score': 0.5033916071697604, 'start': 32, 'end': 36, 'answer': 'BETA'}

The question-answering pipeline answers questions using information from a given context:

4.7 Summarization

summarizer = pipeline("summarization")
summarizer(
    """This paper offers insights into the diffusion and impact of artificial intelligence in science.
More specifically, we show that neural network-based technology meets the essential properties of emerging technologies in the scientific realm.
It is novel, because it shows discontinuous innovations in the originating domain and is put to new uses in many application domains;
it is quick growing, its dimensions being subject to rapid change; it is coherent, because it detaches from its technological parents, and integrates and is accepted in different scientific communities;
and it has a prominent impact on scientific discovery, but a high degree of uncertainty and ambiguity associated with this impact.
Our findings suggest that intelligent machines diffuse in the sciences, reshape the nature of the discovery process and affect the organization of science.
We propose a new conceptual framework that considers artificial intelligence as an emerging general method of invention and, on this basis, derive its policy implications."""
)
## [{'summary_text': ' Neural network-based technology meets the essential properties of emerging technologies in the scientific realm . It is novel, because it shows discontinuous innovations in the originating domain and is put to new uses in many application domains . Researchers propose a new conceptual framework that considers artificial intelligence as an emerging general method of invention .'}]

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:

5 Using specific model

See this Warning message: “Using a pipeline without specifying a model name and revision in production is not recommended.”

Recommendation:
- Specify the model name and revision when using pipelines in production environments.
Actionable step:
- Choose a specific model from the 1M+ models available on Hugging Face.
Resource:
- Hugging Face model repository: https://huggingface.co/models

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In my programming course in DS2E I will",
    max_length=30,
    num_return_sequences=3,
)
## [{'generated_text': "In my programming course in DS2E I will be teaching the class to code in a way that can be translated into languages that are used for this class. I will be using this class to translate the code into languages that are used for this class. If you are interested in using this class, please refer to the following links:\nThis class is part of a series on how to integrate into the language into the language.\nGetting Started\nI'll be using the Python 3.2 compiler.\nThe Python 3.2 compiler is used by me and I'll be using it as a base for converting code from the language into the language.\nThe Python 3.2 compiler is used by me and I'll be using it as a base for converting code from the language into the language.\nTo use this class in my programming course in DS2E, I'll be using it as a base for converting code from the language into the language.\nThe Python 3.2 compiler is used by me and I'll be using it as a base for converting code from the language into the language.\nTo use this class in my programming course in DS2E, I'll be using it as a base for converting code from the language into the language.\nTo use this class in my programming course in DS"}, {'generated_text': 'In my programming course in DS2E I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a GUI. I will show you how to create a'}, {'generated_text': 'In my programming course in DS2E I will be in a game called the Nukem-Gods. It will be out for a few months, and I have been a part of it for most of my last year. And now for the first time I\'m going to be making it in a game called the Nukem-Gods.\nI\'m going to be making it in a game called the Nukem-Gods. In order to get it, I need to use the Nukem-Gods. I\'ll make the game as a "non-sequential game" that will be a "fantasy game". It\'ll be for about a month, and I will be making the game as a "non-sequential game".\nI\'m going to be making it as a "non-sequential game". It\'ll be for about a month, and I will be making the game as a "non-sequential game". It will be for about a month, and I will be making the game as a "non-sequential game". It will be for about a month, and I will be making the game as a "non-sequential game". It will be for about a month, and I will be making the game as a "non-'}]

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")
## [{'translation_text': 'This course is produced by Hugging Face.'}]

For translation, you can use a default model if you provide a language pair in the task name (such as “translation_en_to_fr”), but the easiest way is to pick the model you want to use on the Model Hub.

Here we’ll try translating from French to English:

6 Bias and limitations

Pretrained and fine-tuned models are powerful tools, but they have limitations.
The main limitation stems from the nature of pretraining on large datasets.
- Data is often scraped indiscriminately from the internet.
- This includes both high-quality and low-quality content.
An example is provided with a fill-mask pipeline using the BERT model.

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])
## ['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
## ['nurse', 'maid', 'teacher', 'waitress', 'prostitute']

7 Behind the pipeline function

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
sentiments = classifier(
        ["I hate teaching",
         "I love programming"]
    )
sentiments
## [{'label': 'NEGATIVE', 'score': 0.9989008903503418}, {'label': 'POSITIVE', 'score': 0.9998173117637634}]

this pipeline groups together three steps: preprocessing, passing the inputs through the model, and postprocessing:

7.1 Preprocessing with a tokenizer

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Transformer models can’t process raw text directly.
The first step is to convert text inputs into numbers using a tokenizer.
The tokenizer is responsible for:
- Splitting the input into tokens (words, subwords, or symbols like punctuation).
- Mapping each token to an integer.
- Adding additional inputs useful to the model.
Preprocessing must be consistent with how the model was pretrained.
AutoTokenizer class and its from_pretrained() method help fetch and cache the tokenizer information.

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The default checkpoint for the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english.
🤗 Transformers can be used without concern for the underlying ML framework (PyTorch, TensorFlow, or Flax).
Transformer models require tensors as input.
Tensors are similar to NumPy arrays, which can have:
- 0D (scalar),
- 1D (vector),
- 2D (matrix),
- or more dimensions.
Other ML frameworks’ tensors behave similarly to NumPy arrays and are easy to instantiate.

raw_inputs = [
    "I hate teaching",
    "I love programming",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # Here pytorch
inputs
## {'input_ids': tensor([[ 101, 1045, 5223, 4252,  102],
##         [ 101, 1045, 2293, 4730,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
##         [1, 1, 1, 1, 1]])}

Output Structure:
- A dictionary containing two keys: input_ids and attention_mask.
- input_ids: Two rows of integers, one for each sentence, representing the unique identifiers of the tokens.
- attention_mask: A tensor with the same shape as the input_ids, filled with 0s and 1s, where:
  - 1s indicate tokens to be attended to.
  - 0s indicate tokens to be ignored by the model’s attention layers.
Padding:
- Padding ensures all sequences in a batch match the length of the longest sequence by adding a special padding token (e.g., [PAD]).
- Padding is important for:
  - Consistency in sequence length across a batch, ensuring efficient processing.
  - Avoiding model bias caused by sequence length variations, which could affect performance.
Example:
- Sentences:
  - “I love NLP.”
  - “Padding in tokenizers is useful.”
- Maximum sequence length = 5 tokens.
- The shorter sentence is padded:
  - “I love NLP [PAD] [PAD]”

7.2 Tokenizers

7.2.1 Word-based Tokenizer

tokenized_text = "tokenize the text into words by applying Python’s split() function".split()
tokenized_text
## ['tokenize', 'the', 'text', 'into', 'words', 'by', 'applying', 'Python’s', 'split()', 'function']

Simple and easy to set up with few rules.
Yields decent results for many applications.
Goal:
- Split raw text into words.
- Find a numerical representation for each word.
Text Splitting Methods:
- Can split text in different ways.
- Example: Using whitespace to tokenize text into words.
- Python’s split() function can be used for this purpose.

tokenized_text = "tokenize the text into words by applying Python’s split() function".split()
tokenized_text
## ['tokenize', 'the', 'text', 'into', 'words', 'by', 'applying', 'Python’s', 'split()', 'function']

Word tokenizers can include extra rules for punctuation.
These tokenizers create vocabularies, defined by the total number of independent tokens in the corpus.
Each word in the corpus is assigned a unique ID, starting from 0, which the model uses to identify each word.
A comprehensive word-based tokenizer needs an identifier for every word in a language, leading to a large number of tokens.
- Example: The English language has over 500,000 words, meaning a vast vocabulary and many unique IDs.
- Words like “dog” and “dogs” or “run” and “running” are seen as unrelated by the model initially, as there’s no inherent recognition of similarity.
Tokenizers include an “unknown” token (often [UNK] or <unk>) for words not in the vocabulary.
If many unknown tokens are produced, it indicates that the tokenizer is struggling to represent words accurately, losing information.

7.2.2 Character-based

Character-based tokenization splits text into characters rather than words.

Primary benefits: - Smaller vocabulary. - Fewer out-of-vocabulary (unknown) tokens, as every word can be constructed from characters.

Challenges: - Handling spaces and punctuation can raise issues.

Drawbacks: - Representation may be less meaningful since individual characters carry less information compared to words. - This varies across languages (e.g., Chinese characters hold more information than characters in Latin languages). - It produces a larger number of tokens to process. A word that is a single token in word-based tokenization may become 10+ tokens in character-based tokenization.

7.2.3 Subword tokenization

Subword tokenization offers a compromise, combining word-based and character-based approaches.

Subword tokenization algorithms are based on the idea that:
- Frequently used words should not be split.
- Rare words should be decomposed into meaningful subwords.
Example:
- “Annoyingly” could be split into “annoying” and “ly”.
- These subwords are more common and retain the original meaning.
- Let's</w> do</w> token ization</w> !</w>
Benefits of subword tokenization:
- Semantic meaning is preserved through subword combinations.
- Efficient representation of long words using fewer tokens.
- Achieves good vocabulary coverage.
- Minimizes unknown tokens.
Loading and saving

tokenizer.save_pretrained("/Users/peltouz/Documents/pretrain/test")
## ('/Users/peltouz/Documents/pretrain/test/tokenizer_config.json', '/Users/peltouz/Documents/pretrain/test/special_tokens_map.json', '/Users/peltouz/Documents/pretrain/test/vocab.txt', '/Users/peltouz/Documents/pretrain/test/added_tokens.json', '/Users/peltouz/Documents/pretrain/test/tokenizer.json')

Loading and saving tokenizers is similar to handling models.
It uses the same two methods: from_pretrained() and save_pretrained().
These methods load or save:
- The algorithm used by the tokenizer (comparable to the architecture of a model).
- The vocabulary (comparable to the weights of a model).

tokenizer.save_pretrained("/Users/peltouz/Documents/pretrain/test")
## ('/Users/peltouz/Documents/pretrain/test/tokenizer_config.json', '/Users/peltouz/Documents/pretrain/test/special_tokens_map.json', '/Users/peltouz/Documents/pretrain/test/vocab.txt', '/Users/peltouz/Documents/pretrain/test/added_tokens.json', '/Users/peltouz/Documents/pretrain/test/tokenizer.json')

7.3 Going through the model

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

from transformers import AutoModel

checkpoint = “distilbert-base-uncased-finetuned-sst-2-english” model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
## torch.Size([2, 5, 768])

The outputs of 🤗 Transformers models resemble namedtuples or dictionaries.
You can access elements in different ways:
- By attributes (e.g., outputs.last_hidden_state)
- By key (e.g., outputs["last_hidden_state"])
- By index, if you know the exact position (e.g., outputs[0])

7.4 Model heads: Making sense out of numbers

Take high-dimensional vectors of hidden states as input.
Project these onto a different dimension.
Usually composed of one or a few linear layers.
Transformers and Model Heads:
- The output of the Transformer model is sent directly to the model head for further processing.
- The embeddings layer in the model converts input IDs into vectors representing the associated tokens.
- Subsequent layers manipulate these vectors using the attention mechanism to produce final sentence representations.
Available Architectures in 🤗 Transformers:
- Model (retrieves hidden states)
- ForCausalLM
- ForMaskedLM
- ForMultipleChoice
- ForQuestionAnswering
- ForSequenceClassification
- ForTokenClassification
For example for sentence classification tasks, use AutoModelForSequenceClassification instead of AutoModel.

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

The model head reduces dimensionality by taking high-dimensional vectors as input.
It outputs vectors containing two values, one for each label.
Since there are two sentences and two labels, the resulting output has a shape of 2 x 2.

print(outputs.logits.shape)
## torch.Size([2, 2])
print(outputs.logits)
## tensor([[ 3.7257, -3.0865],
##         [-4.1563,  4.4512]], grad_fn=<AddmmBackward0>)

The values referred to are logits, not probabilities.
Logits are raw, unnormalized scores outputted by the model’s last layer.
To convert logits to probabilities, they must pass through a SoftMax layer.
- SoftMax is a generalization of the logistic function to multiple dimensions.
- It is used in multinomial logistic regression.

During training, the loss function typically combines the final activation function (e.g., SoftMax) with the loss function (e.g., cross entropy).

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
## tensor([[9.9890e-01, 1.0991e-03],
##         [1.8271e-04, 9.9982e-01]], grad_fn=<SoftmaxBackward0>)

These are recognizable probability scores.
To get the labels corresponding to each position, we can inspect the id2label attribute of the model config (more on this in the next section):


model.config.id2label
## {0: 'NEGATIVE', 1: 'POSITIVE'}

8 Models

AutoModel Class:
- Designed to instantiate any model from a checkpoint.
- Functions as a wrapper for the various models in the library.
- Automatically guesses the appropriate model architecture for the checkpoint.
- Instantiates a model with the guessed architecture.
Direct Model Class Usage:
- If the model type is known, the specific class defining the architecture can be used directly (e.g., BERT model).

8.1 Creating a Transformer

The first thing we’ll need to do to initialize a BERT model is load a configuration object:

from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

print(config)
## BertConfig {
##   "attention_probs_dropout_prob": 0.1,
##   "classifier_dropout": null,
##   "hidden_act": "gelu",
##   "hidden_dropout_prob": 0.1,
##   "hidden_size": 768,
##   "initializer_range": 0.02,
##   "intermediate_size": 3072,
##   "layer_norm_eps": 1e-12,
##   "max_position_embeddings": 512,
##   "model_type": "bert",
##   "num_attention_heads": 12,
##   "num_hidden_layers": 12,
##   "pad_token_id": 0,
##   "position_embedding_type": "absolute",
##   "transformers_version": "4.57.1",
##   "type_vocab_size": 2,
##   "use_cache": true,
##   "vocab_size": 30522
## }

The model can be used in its current state but will output gibberish.
The model requires training before it can perform well.
Training the model from scratch would:
- Take a long time.
- Require a lot of data.
- Have a non-negligible environmental impact.
Instead, reusing pre-trained models can avoid unnecessary and duplicated efforts.
Pre-trained Transformer models can be easily loaded using the from_pretrained() method.

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Using AutoModel:
- Replace BertModel with AutoModel to produce checkpoint-agnostic code.
- This ensures compatibility with different checkpoints trained for similar tasks, even if the architecture varies.
Loading Pretrained Models:
- In the example, BertConfig was not used; instead, a pretrained model (bert-base-cased) was loaded.
- This specific checkpoint was trained by the original BERT authors, with details available in the model card.
Model Initialization and Usage:
- The model is initialized with pretrained weights from the checkpoint.
- It can be used directly for inference or fine-tuned on new tasks.
- Using pretrained weights helps achieve good results faster than training from scratch.
Caching and Customizing Cache Folder:
- Weights are downloaded and cached in the default folder ~/.cache/huggingface/transformers.
- Cache folder location can be customized by setting the HF_HOME environment variable.
Model Identifiers:
- The identifier used to load the model can be from any compatible model on the Model Hub.
- A full list of available BERT checkpoints can be found here.

8.2 Saving methods

Use the save_pretrained() method to save a model.

model.save_pretrained("/Users/peltouz/Documents/pretrain/test")

config.json file:
- Contains attributes needed to build the model architecture.
- Includes metadata:
  - Checkpoint origin.
  - 🤗 Transformers version used when the checkpoint was last saved.
pytorch_model.bin file:
- Known as the state dictionary.
- Contains all the model’s weights (parameters).
Relationship:
- The config.json file provides the model architecture.
- The pytorch_model.bin file holds the model’s parameters (weights).

9 Wrapping up: From tokenizer to model


import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

predictions = torch.nn.functional.softmax(output.logits, dim=-1)
print(predictions)
## tensor([[4.0195e-02, 9.5981e-01],
##         [5.3534e-04, 9.9946e-01]], grad_fn=<SoftmaxBackward0>)
model.config.id2label
## {0: 'NEGATIVE', 1: 'POSITIVE'}

Hugging Face