1 Transformers model

 

1.1 For which tasks?

  • Classifying whole sentences:
    • Examples: Sentiment analysis, spam detection, grammatical correctness, sentence relationship.
  • Classifying each word in a sentence:
    • Examples: Grammatical components (noun, verb, adjective), named entity recognition (person, location, organization).
  • Generating text content:
    • Examples: Autocomplete text from a prompt, fill in masked words in a text.
  • Extracting an answer from a text:
    • Examples: Given a question and context, extract the answer based on the context.
  • Generating a new sentence from an input text:
    • Examples: Translation, text summarization.

 

1.1.1 A challenging task

  • Human vs Machine Language Processing:
    • Humans understand sentences like “I am hungry” easily, and can compare similar sentences like “I am hungry” and “I am sad”.
    • Machine learning models find it more difficult to process and understand language, requiring careful text processing.
  • Introduction of Transformers:
    • 2017: Introduction of Transformers by Vaswani et al. in the paper “Attention Is All You Need.”
      • Neural network architecture learning context and relationships from sequential data, initially focused on translation tasks.
  • Influential Transformer Models:
    • June 2018: GPT - First pretrained transformer model, state-of-the-art results on various NLP tasks.
    • October 2018: BERT - Designed to produce better sentence summaries.
    • February 2019: GPT-2 - Improved version of GPT, not immediately released due to ethical concerns.
    • October 2019: DistilBERT - A distilled version of BERT, 60% faster, 40% lighter, retaining 97% of BERT’s performance.
    • October 2019: BART and T5 - Pretrained models using the original Transformer architecture.
    • May 2020: GPT-3 - Larger model, capable of zero-shot learning.
    • November 2022: GPT-3.5 and March 2023: GPT-4.
    • Followed by other models like Llama, Claude, Gemini, Mistral.

 

2 Pretraining and Fine-tuning

  • Transformer models (GPT, BERT, BART, T5, etc.) are trained as language models on large amounts of raw text using self-supervised learning.
    • Self-supervised learning: the model learns without human-labeled data, as the objective is computed automatically from the inputs.
    • These models develop a statistical understanding of language but aren’t directly useful for specific tasks.
  • Transfer learning: the process where a pretrained model is fine-tuned on specific tasks using supervised learning (human-annotated labels).
    • Pretrained models undergo fine-tuning to adapt to particular tasks.
  • Pretraining:
    • The model is trained from scratch with randomly initialized weights.
    • Pretraining is resource-intensive in terms of time, data, and money.
    • It requires a large corpus and can take weeks to complete.
  • Fine-tuning:
    • Performed after pretraining using a dataset specific to a task.
    • Reasons to fine-tune instead of training from scratch:
      • The pretrained model shares similarities with the fine-tuning dataset.
      • Less data and time are required to achieve good results.
      • Lower costs in terms of time, data, financial, and environmental resources.
      • It allows for faster iteration and refinement.
  • Transfer learning advantages:
    • Leverages knowledge from pretraining for improved task-specific performance.
    • Fine-tuning achieves better results than training from scratch unless massive data is available.
    • Using pretrained models close to the target task optimizes performance.

 

3 Transformers Library

  • The 🤗 Transformers library provides tools to create and use shared models.

  • Model Hub:

    • Contains thousands of pretrained models for download and use.
    • Users can also upload their own models to the Hub.
  • Pipeline() function:

    • The most basic object in the library.
    • Connects a model with necessary preprocessing and postprocessing steps.
    • Allows direct input of text for intelligible output.
  • Three main steps in a pipeline:

    1. Text is preprocessed into a format the model understands.
    2. Preprocessed inputs are passed to the model.
    3. Model predictions are post-processed for easy interpretation.
  • Available pipelines:

    • feature-extraction (vector representation of a text)
    • fill-mask
    • ner (named entity recognition)
    • question-answering
    • sentiment-analysis
    • summarization
    • text-generation
    • translation
    • zero-shot-classification
import os
hf_token = "your-key"
custom_cache_dir = '/Users/peltouz/Documents/pretrain'

os.environ['HF_HOME'] = custom_cache_dir  # Hugging Face home directory for all HF operations
os.environ['TRANSFORMERS_CACHE'] = custom_cache_dir  # Transformers-specific cache directory
os.environ['HF_DATASETS_CACHE'] = custom_cache_dir  # Datasets-specific cache directory
os.environ['HF_METRICS_CACHE'] = custom_cache_dir  # Metrics-specific cache directory
os.environ['HF_TOKEN'] = hf_token  # Hugging Face API token

This Python code snippet configures the environment to specify custom cache directories for operations involving Hugging Face libraries, and it sets an API token for authentication.

  • API Token Configuration:
    • hf_token: Assigned to a string representing the Hugging Face API token.
    • Purpose: Used for authenticating and accessing Hugging Face services that require credentials.
  • Custom Cache Directory:
    • custom_cache_dir: Set to D:/pretrain.
    • Purpose: Specifies a base directory for storing cache files related to Hugging Face operations.
  • Environment Variables:
    • HF_HOME: Specifies the base directory for all Hugging Face-related operations.
    • TRANSFORMERS_CACHE: Directory for caching models downloaded via the transformers library.
    • HF_DATASETS_CACHE: Directory for caching datasets accessed via the datasets library.
    • HF_METRICS_CACHE: Directory for caching metrics-related files used in model evaluation.
    • HF_TOKEN: Environment variable for the Hugging Face API token to authenticate requests to Hugging Face services.
  • Purpose of Environment Variables:
    • Ensure all data related to models, datasets, and metrics are stored in the specified directory (D:/pretrain).
    • Help manage disk space, especially when handling large models or datasets.
    • Correctly configure credentials for accessing restricted services.

 

4 The Pipeline function

 

4.1 Sentiment Analysis

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
sentiments = classifier(
        ["I hate teaching",
         "I love programming"]
    )

print(sentiments)
## [{'label': 'NEGATIVE', 'score': 0.9989008903503418}, {'label': 'POSITIVE', 'score': 0.9998173117637634}]
sentiments[0]['label']
## 'NEGATIVE'
  • A pretrained model fine-tuned for sentiment analysis in English is selected by default.
  • The model is downloaded and cached when the classifier object is created.
  • Upon rerunning the command, the cached model is used without needing to download again.

 

4.2 Zero-shot classification

classifier = pipeline("zero-shot-classification")
classifier(
    "df %>% filter(!is.na(var1))",
    candidate_labels=["python", "Rstudio"],
)
## {'sequence': 'df %>% filter(!is.na(var1))', 'labels': ['Rstudio', 'python'], 'scores': [0.6543467044830322, 0.345653235912323]}
  • Main idea:
    • Classifying unlabelled texts is a challenging task often encountered in real-world projects.
  • Comparison:
    • Annotating text manually is time-consuming and requires domain expertise.
  • Key aspect:
    • The zero-shot-classification pipeline is powerful because it lets you specify your own labels for classification, instead of relying on pretrained model labels.

 

4.3 Text generation

generator = pipeline("text-generation")
generator("In my programming course in DS2E I will")
## [{'generated_text': "In my programming course in DS2E I will be presenting you with a simple and elegant way to test this idea.\n\nI have already covered a few examples here, so I will present you with a simple test to show you what it does.\n\nLet's start with the idea that you can put this test in your program and run it.\n\nclass SimpleTest(object): def __init__(self, tests): self.tests.add_function() self.tests.add_function() self.tests.execute()\n\nThe test code will return a function.\n\nIn my programming course in DS2E this function will return a function.\n\nclass TestFunction(object): def __init__(self, tests): self.tests.add_function() self.tests.add_function()\n\nThe test code will return a function.\n\nIn my programming course in DS2E it will return a function.\n\nclass SimpleTestProcedure(object): def __init__(self, tests): self.tests.add_function() self.tests.add_function()\n\nThe test code will return a function.\n\nclass SimpleTestFunctionTest(object): def __init__(self, tests): self.tests."}]
  • Main idea:
    • A prompt is provided, and the model auto-completes it by generating the remaining text.
  • Comparison:
    • Similar to the predictive text feature on phones.
  • Key aspect:
    • Text generation involves randomness, so results may vary

 

4.4 Mask filling

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
## [{'score': 0.19620011746883392, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}, {'score': 0.04052743315696716, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}]
  • Main idea:
    • The task involves filling in the blanks in a given text.
  • Comparison:
    • Similar to completing sentences in cloze tests or gap-filling exercises.
  • Key aspect:
    • The focus is on providing contextually appropriate words or phrases to complete the text.

 

4.5 Named entity recognition

ner = pipeline("ner", grouped_entities=True)
## /Users/peltouz/Documents/GitHub/M2-Py-DS2E/hf/lib/python3.13/site-packages/transformers/pipelines/token_classification.py:186: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead.
##   warnings.warn(
ner("My name is Pierre and I work at BETA in Strasbourg.")
## [{'entity_group': 'PER', 'score': np.float32(0.99918455), 'word': 'Pierre', 'start': 11, 'end': 17}, {'entity_group': 'ORG', 'score': np.float32(0.9977419), 'word': 'BETA', 'start': 32, 'end': 36}, {'entity_group': 'LOC', 'score': np.float32(0.9894748), 'word': 'Strasbourg', 'start': 40, 'end': 50}]
  • Main idea:
    • Named entity recognition (NER) involves identifying parts of the input text that correspond to entities.
  • Comparison:
    • NER focuses on entities like persons, locations, or organizations in the text.

 

4.6 Question answering

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Pierre and I work at BETA in Strasbourg.",
)
## {'score': 0.5033916071697604, 'start': 32, 'end': 36, 'answer': 'BETA'}

The question-answering pipeline answers questions using information from a given context:

 

4.7 Summarization

summarizer = pipeline("summarization")
summarizer(
    """This paper offers insights into the diffusion and impact of artificial intelligence in science.
More specifically, we show that neural network-based technology meets the essential properties of emerging technologies in the scientific realm.
It is novel, because it shows discontinuous innovations in the originating domain and is put to new uses in many application domains;
it is quick growing, its dimensions being subject to rapid change; it is coherent, because it detaches from its technological parents, and integrates and is accepted in different scientific communities;
and it has a prominent impact on scientific discovery, but a high degree of uncertainty and ambiguity associated with this impact.
Our findings suggest that intelligent machines diffuse in the sciences, reshape the nature of the discovery process and affect the organization of science.
We propose a new conceptual framework that considers artificial intelligence as an emerging general method of invention and, on this basis, derive its policy implications."""
)
## [{'summary_text': ' Neural network-based technology meets the essential properties of emerging technologies in the scientific realm . It is novel, because it shows discontinuous innovations in the originating domain and is put to new uses in many application domains . Researchers propose a new conceptual framework that considers artificial intelligence as an emerging general method of invention .'}]

Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here’s an example:

 

5 Using specific model

See this Warning message: “Using a pipeline without specifying a model name and revision in production is not recommended.”

  • Recommendation:
    • Specify the model name and revision when using pipelines in production environments.
  • Actionable step:
    • Choose a specific model from the 1M+ models available on Hugging Face.
  • Resource:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In my programming course in DS2E I will",
    max_length=30,
    num_return_sequences=3,
)
## [{'generated_text': "In my programming course in DS2E I will make a series of useful programming concepts, including the main object class, the class and the class itself.\n\nHere's a list of more useful programming concepts:\n1. SetInterpolation()\n2. SetInterpolation()\n3. SetInterpolation()\n4. SetInterpolation()\n5. SetInterpolation()\n6. SetInterpolation()\n7. SetInterpolation()\n8. SetInterpolation()\n9. SetInterpolation()\n10. SetInterpolation()\n11. SetInterpolation()\n12. SetInterpolation()\n13. SetInterpolation()\n14. SetInterpolation()\n2015. SetInterpolation()\n2015. SetInterpolation()\n2016. SetInterpolation()\n2017. SetInterpolation()\n2018. SetInterpolation()\n2019. SetInterpolation()\n20. SetInterpolation()\n21. SetInterpolation()\n22. SetInterpolation()\n2015. SetInterpolation()\n2016. SetInterpolation()\n2017. SetInterpolation()\n2018. SetInterpolation()\n2019. SetInterpolation"}, {'generated_text': 'In my programming course in DS2E I will also be using the DSP and the DSP. This course is designed to help with DS2E and DS2E: The DSP and DSP.\n\n\n\nIn this course I will also be using the DSP and DSP.\nIn my course I will also be using the DSP and DSP.'}, {'generated_text': "In my programming course in DS2E I will have to wait for the end of his semester in order to work on a new game, so I'll be able to take a few weeks off, so it feels like I've completed everything I've been doing.\n\nI will be working on an upcoming game, Agrins, that I can play on the DS2E. This is my first game with the DS2E. I'll be using the game to play on the DS2E. This is my first game with the DS2E. I will be using the game to play on the DS2E. This is my first game with the DS2E. I will be using the game to play on the DS2E. This is my first game with the DS2E. I will be using the game to play on the DS2E. This is my first game with the DS2E. I will be using the game to play on the DS2E. This is my first game with the DS2E. I will be using the game to play on the DS2E.\nI will be using the game to play on the DS2E. This is my first game with the DS2E. I will be using the game to play on the DS2"}]
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")
## [{'translation_text': 'This course is produced by Hugging Face.'}]

For translation, you can use a default model if you provide a language pair in the task name (such as “translation_en_to_fr”), but the easiest way is to pick the model you want to use on the Model Hub.

Here we’ll try translating from French to English:

 

6 Bias and limitations

  • Pretrained and fine-tuned models are powerful tools, but they have limitations.

  • The main limitation stems from the nature of pretraining on large datasets.

    • Data is often scraped indiscriminately from the internet.
    • This includes both high-quality and low-quality content.
  • An example is provided with a fill-mask pipeline using the BERT model.

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])
## ['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
## ['nurse', 'maid', 'teacher', 'waitress', 'prostitute']

 

7 Behind the pipeline function

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
sentiments = classifier(
        ["I hate teaching",
         "I love programming"]
    )
sentiments
## [{'label': 'NEGATIVE', 'score': 0.9989008903503418}, {'label': 'POSITIVE', 'score': 0.9998173117637634}]

this pipeline groups together three steps: preprocessing, passing the inputs through the model, and postprocessing:

 

7.1 Preprocessing with a tokenizer

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
  • Transformer models can’t process raw text directly.

  • The first step is to convert text inputs into numbers using a tokenizer.

  • The tokenizer is responsible for:

    • Splitting the input into tokens (words, subwords, or symbols like punctuation).
    • Mapping each token to an integer.
    • Adding additional inputs useful to the model.
  • Preprocessing must be consistent with how the model was pretrained.

  • AutoTokenizer class and its from_pretrained() method help fetch and cache the tokenizer information.

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
  • The default checkpoint for the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english.

  • 🤗 Transformers can be used without concern for the underlying ML framework (PyTorch, TensorFlow, or Flax).

  • Transformer models require tensors as input.

  • Tensors are similar to NumPy arrays, which can have:

    • 0D (scalar),
    • 1D (vector),
    • 2D (matrix),
    • or more dimensions.
  • Other ML frameworks’ tensors behave similarly to NumPy arrays and are easy to instantiate.

raw_inputs = [
    "I hate teaching",
    "I love programming",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # Here pytorch
inputs
## {'input_ids': tensor([[ 101, 1045, 5223, 4252,  102],
##         [ 101, 1045, 2293, 4730,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
##         [1, 1, 1, 1, 1]])}
  • Output Structure:
    • A dictionary containing two keys: input_ids and attention_mask.
    • input_ids: Two rows of integers, one for each sentence, representing the unique identifiers of the tokens.
    • attention_mask: A tensor with the same shape as the input_ids, filled with 0s and 1s, where:
      • 1s indicate tokens to be attended to.
      • 0s indicate tokens to be ignored by the model’s attention layers.
  • Padding:
    • Padding ensures all sequences in a batch match the length of the longest sequence by adding a special padding token (e.g., [PAD]).
    • Padding is important for:
      • Consistency in sequence length across a batch, ensuring efficient processing.
      • Avoiding model bias caused by sequence length variations, which could affect performance.
  • Example:
    • Sentences:
      • “I love NLP.”
      • “Padding in tokenizers is useful.”
    • Maximum sequence length = 5 tokens.
    • The shorter sentence is padded:
      • “I love NLP [PAD] [PAD]”

 

7.2 Tokenizers

 

7.2.1 Word-based Tokenizer

tokenized_text = "tokenize the text into words by applying Python’s split() function".split()
tokenized_text
## ['tokenize', 'the', 'text', 'into', 'words', 'by', 'applying', 'Python’s', 'split()', 'function']
  • Simple and easy to set up with few rules.

  • Yields decent results for many applications.

  • Goal:

    • Split raw text into words.
    • Find a numerical representation for each word.
  • Text Splitting Methods:

    • Can split text in different ways.
    • Example: Using whitespace to tokenize text into words.
    • Python’s split() function can be used for this purpose.
tokenized_text = "tokenize the text into words by applying Python’s split() function".split()
tokenized_text
## ['tokenize', 'the', 'text', 'into', 'words', 'by', 'applying', 'Python’s', 'split()', 'function']
  • Word tokenizers can include extra rules for punctuation.

  • These tokenizers create vocabularies, defined by the total number of independent tokens in the corpus.

  • Each word in the corpus is assigned a unique ID, starting from 0, which the model uses to identify each word.

  • A comprehensive word-based tokenizer needs an identifier for every word in a language, leading to a large number of tokens.

    • Example: The English language has over 500,000 words, meaning a vast vocabulary and many unique IDs.
    • Words like “dog” and “dogs” or “run” and “running” are seen as unrelated by the model initially, as there’s no inherent recognition of similarity.
  • Tokenizers include an “unknown” token (often [UNK] or <unk>) for words not in the vocabulary.

  • If many unknown tokens are produced, it indicates that the tokenizer is struggling to represent words accurately, losing information.

 

7.2.2 Character-based

Character-based tokenization splits text into characters rather than words.

Primary benefits: - Smaller vocabulary. - Fewer out-of-vocabulary (unknown) tokens, as every word can be constructed from characters.

Challenges: - Handling spaces and punctuation can raise issues.

Drawbacks: - Representation may be less meaningful since individual characters carry less information compared to words. - This varies across languages (e.g., Chinese characters hold more information than characters in Latin languages). - It produces a larger number of tokens to process. A word that is a single token in word-based tokenization may become 10+ tokens in character-based tokenization.

 

7.2.3 Subword tokenization

Subword tokenization offers a compromise, combining word-based and character-based approaches.

  • Subword tokenization algorithms are based on the idea that:
    • Frequently used words should not be split.
    • Rare words should be decomposed into meaningful subwords.
  • Example:
    • “Annoyingly” could be split into “annoying” and “ly”.
    • These subwords are more common and retain the original meaning.
    • Let's</w> do</w> token ization</w> !</w>
  • Benefits of subword tokenization:
    • Semantic meaning is preserved through subword combinations.
    • Efficient representation of long words using fewer tokens.
    • Achieves good vocabulary coverage.
    • Minimizes unknown tokens.

    Loading and saving

tokenizer.save_pretrained("/Users/peltouz/Documents/pretrain/test")
## ('/Users/peltouz/Documents/pretrain/test/tokenizer_config.json', '/Users/peltouz/Documents/pretrain/test/special_tokens_map.json', '/Users/peltouz/Documents/pretrain/test/vocab.txt', '/Users/peltouz/Documents/pretrain/test/added_tokens.json', '/Users/peltouz/Documents/pretrain/test/tokenizer.json')
  • Loading and saving tokenizers is similar to handling models.

  • It uses the same two methods: from_pretrained() and save_pretrained().

  • These methods load or save:

    • The algorithm used by the tokenizer (comparable to the architecture of a model).
    • The vocabulary (comparable to the weights of a model).
tokenizer.save_pretrained("/Users/peltouz/Documents/pretrain/test")
## ('/Users/peltouz/Documents/pretrain/test/tokenizer_config.json', '/Users/peltouz/Documents/pretrain/test/special_tokens_map.json', '/Users/peltouz/Documents/pretrain/test/vocab.txt', '/Users/peltouz/Documents/pretrain/test/added_tokens.json', '/Users/peltouz/Documents/pretrain/test/tokenizer.json')

 

7.3 Going through the model

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

from transformers import AutoModel

checkpoint = “distilbert-base-uncased-finetuned-sst-2-english” model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
## torch.Size([2, 5, 768])
  • The outputs of 🤗 Transformers models resemble namedtuples or dictionaries.

  • You can access elements in different ways:

    • By attributes (e.g., outputs.last_hidden_state)
    • By key (e.g., outputs["last_hidden_state"])
    • By index, if you know the exact position (e.g., outputs[0])

 

7.4 Model heads: Making sense out of numbers

  • Take high-dimensional vectors of hidden states as input.

  • Project these onto a different dimension.

  • Usually composed of one or a few linear layers.

  • Transformers and Model Heads:

    • The output of the Transformer model is sent directly to the model head for further processing.
    • The embeddings layer in the model converts input IDs into vectors representing the associated tokens.
    • Subsequent layers manipulate these vectors using the attention mechanism to produce final sentence representations.
  • Available Architectures in 🤗 Transformers:

    • Model (retrieves hidden states)
    • ForCausalLM
    • ForMaskedLM
    • ForMultipleChoice
    • ForQuestionAnswering
    • ForSequenceClassification
    • ForTokenClassification
  • For example for sentence classification tasks, use AutoModelForSequenceClassification instead of AutoModel.

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
  • The model head reduces dimensionality by taking high-dimensional vectors as input.

  • It outputs vectors containing two values, one for each label.

  • Since there are two sentences and two labels, the resulting output has a shape of 2 x 2.

print(outputs.logits.shape)
## torch.Size([2, 2])
print(outputs.logits)
## tensor([[ 3.7257, -3.0865],
##         [-4.1563,  4.4512]], grad_fn=<AddmmBackward0>)
  • The values referred to are logits, not probabilities.

  • Logits are raw, unnormalized scores outputted by the model’s last layer.

  • To convert logits to probabilities, they must pass through a SoftMax layer.

    • SoftMax is a generalization of the logistic function to multiple dimensions.
    • It is used in multinomial logistic regression.

During training, the loss function typically combines the final activation function (e.g., SoftMax) with the loss function (e.g., cross entropy).

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
## tensor([[9.9890e-01, 1.0991e-03],
##         [1.8271e-04, 9.9982e-01]], grad_fn=<SoftmaxBackward0>)
  • These are recognizable probability scores.

  • To get the labels corresponding to each position, we can inspect the id2label attribute of the model config (more on this in the next section):


model.config.id2label
## {0: 'NEGATIVE', 1: 'POSITIVE'}

 

8 Models

  • AutoModel Class:
    • Designed to instantiate any model from a checkpoint.
    • Functions as a wrapper for the various models in the library.
    • Automatically guesses the appropriate model architecture for the checkpoint.
    • Instantiates a model with the guessed architecture.
  • Direct Model Class Usage:
    • If the model type is known, the specific class defining the architecture can be used directly (e.g., BERT model).

 

8.1 Creating a Transformer

The first thing we’ll need to do to initialize a BERT model is load a configuration object:

from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

print(config)
## BertConfig {
##   "attention_probs_dropout_prob": 0.1,
##   "classifier_dropout": null,
##   "hidden_act": "gelu",
##   "hidden_dropout_prob": 0.1,
##   "hidden_size": 768,
##   "initializer_range": 0.02,
##   "intermediate_size": 3072,
##   "layer_norm_eps": 1e-12,
##   "max_position_embeddings": 512,
##   "model_type": "bert",
##   "num_attention_heads": 12,
##   "num_hidden_layers": 12,
##   "pad_token_id": 0,
##   "position_embedding_type": "absolute",
##   "transformers_version": "4.57.1",
##   "type_vocab_size": 2,
##   "use_cache": true,
##   "vocab_size": 30522
## }
  • The model can be used in its current state but will output gibberish.

  • The model requires training before it can perform well.

  • Training the model from scratch would:

    • Take a long time.
    • Require a lot of data.
    • Have a non-negligible environmental impact.
  • Instead, reusing pre-trained models can avoid unnecessary and duplicated efforts.

  • Pre-trained Transformer models can be easily loaded using the from_pretrained() method.

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")
  • Using AutoModel:
    • Replace BertModel with AutoModel to produce checkpoint-agnostic code.
    • This ensures compatibility with different checkpoints trained for similar tasks, even if the architecture varies.
  • Loading Pretrained Models:
    • In the example, BertConfig was not used; instead, a pretrained model (bert-base-cased) was loaded.
    • This specific checkpoint was trained by the original BERT authors, with details available in the model card.
  • Model Initialization and Usage:
    • The model is initialized with pretrained weights from the checkpoint.
    • It can be used directly for inference or fine-tuned on new tasks.
    • Using pretrained weights helps achieve good results faster than training from scratch.
  • Caching and Customizing Cache Folder:
    • Weights are downloaded and cached in the default folder ~/.cache/huggingface/transformers.
    • Cache folder location can be customized by setting the HF_HOME environment variable.
  • Model Identifiers:
    • The identifier used to load the model can be from any compatible model on the Model Hub.
    • A full list of available BERT checkpoints can be found here.

 

8.2 Saving methods

  • Use the save_pretrained() method to save a model.
model.save_pretrained("/Users/peltouz/Documents/pretrain/test")
  • config.json file:
    • Contains attributes needed to build the model architecture.
    • Includes metadata:
      • Checkpoint origin.
      • 🤗 Transformers version used when the checkpoint was last saved.
  • pytorch_model.bin file:
    • Known as the state dictionary.
    • Contains all the model’s weights (parameters).
  • Relationship:
    • The config.json file provides the model architecture.
    • The pytorch_model.bin file holds the model’s parameters (weights).

 

9 Wrapping up: From tokenizer to model


import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)
predictions = torch.nn.functional.softmax(output.logits, dim=-1)
print(predictions)
## tensor([[4.0195e-02, 9.5981e-01],
##         [5.3534e-04, 9.9946e-01]], grad_fn=<SoftmaxBackward0>)
model.config.id2label
## {0: 'NEGATIVE', 1: 'POSITIVE'}