from embeddings.pipeline.hf_preprocessing_pipeline import HuggingFacePreprocessingPipeline
Two types of config are defined in our library: BasicConfig
and AdvancedConfig
.
BasicConfig
allows for easy use of the most common parameters in the pipeline.
LightningBasicConfig
LightningBasicConfig (use_scheduler:bool=True, optimizer:str='Adam', warmup_steps:int=100, learning_rate:float=0.0001, adam_epsilon:float=1e-08, weight_decay:float=0.0, finetune_last_n_layers:int=-1, classifier_dropout:Optional[float]=None, max_seq_length:Optional[int]=None, batch_size:int=32, max_epochs:Optional[int]=None, early_stopping_monitor:str='val/Loss', early_stopping_mode:str='min', early_stopping_patience:int=3)
AdvancedConfig
the objects defined in our pipelines are constructed in a way that they can be further paramatrized with keyword arguments. These arguments can be utilized by constructing the
AdvancedConfig
.
LightningAdvancedConfig
LightningAdvancedConfig (finetune_last_n_layers:int, task_model_kwargs:Dict[str,Any], datamodule_kwargs:Dict[str,Any], task_train_kwargs:Dict[str,Any], model_config_kwargs:Dict[str,Any], early_stopping_kwargs:Dict[str,Any], tokenizer_kwargs:Dict[str,Any], batch_encoding_kwargs:Dict[str,Any], dataloader_kwargs:Dict[str,Any])
In summary, the BasicConfig
takes arguments and automatically assign them into proper keyword group, while the AdvancedConfig
takes as the input keyword groups that should be already correctly mapped.
The list of available config can be found below.
Running pipeline with BasicConfig
Let’s run example pipeline on polemo2
dataset
But first we downsample our dataset due to hardware limitations for that purpose we use HuggingFacePreprocessingPipeline
HuggingFacePreprocessingPipeline
HuggingFacePreprocessingPipeline (dataset_name:str, persist_path:str, sam ple_missing_splits:Optional[Tuple[Optio nal[float],Optional[float]]]=None, down sample_splits:Optional[Tuple[Optional[f loat],Optional[float],Optional[float]]] =None, ignore_test_subset:bool=False, seed:int=441, load_dataset_kwargs:Optio nal[Dict[str,Any]]=None)
Preprocessing pipeline dedicated to work with HuggingFace datasets.
Then we need to use run
method
PreprocessingPipeline.run
PreprocessingPipeline.run ()
= HuggingFacePreprocessingPipeline(
prepocessing ="clarin-pl/polemo2-official",
dataset_name="data/polemo2_downsampled",
persist_path=(0.001, 0.005, 0.005)
downsample_splits
) prepocessing.run()
Downloading and preparing dataset polemo2-official/all_text (download: 6.37 MiB, generated: 6.30 MiB, post-processed: Unknown size, total: 12.68 MiB) to /home/runner/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70...
Dataset polemo2-official downloaded and prepared to /home/runner/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70. Subsequent calls will reuse this data.
Downloading builder script: 0%| | 0.00/5.90k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 5.90k/5.90k [00:00<00:00, 5.38MB/s]
Downloading metadata: 0%| | 0.00/23.4k [00:00<?, ?B/s]Downloading metadata: 100%|##########| 23.4k/23.4k [00:00<00:00, 20.0MB/s]
Downloading readme: 0%| | 0.00/5.35k [00:00<?, ?B/s]Downloading readme: 100%|##########| 5.35k/5.35k [00:00<00:00, 5.18MB/s]
No config specified, defaulting to: polemo2-official/all_text
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/5.37M [00:00<?, ?B/s]Downloading data: 100%|##########| 5.37M/5.37M [00:00<00:00, 57.6MB/s]
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 1.57it/s]Downloading data files: 100%|##########| 1/1 [00:00<00:00, 1.57it/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]Extracting data files: 100%|##########| 1/1 [00:00<00:00, 1903.04it/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/663k [00:00<?, ?B/s]Downloading data: 100%|##########| 663k/663k [00:00<00:00, 26.6MB/s]
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 3.19it/s]Downloading data files: 100%|##########| 1/1 [00:00<00:00, 3.19it/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]Extracting data files: 100%|##########| 1/1 [00:00<00:00, 1981.25it/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/649k [00:00<?, ?B/s]Downloading data: 100%|##########| 649k/649k [00:00<00:00, 20.6MB/s]
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 3.35it/s]Downloading data files: 100%|##########| 1/1 [00:00<00:00, 3.35it/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]Extracting data files: 100%|##########| 1/1 [00:00<00:00, 2004.93it/s]
Generating train split: 0%| | 0/6573 [00:00<?, ? examples/s]Generating train split: 40%|#### | 2632/6573 [00:00<00:00, 26267.54 examples/s]Generating train split: 82%|########2 | 5393/6573 [00:00<00:00, 26483.84 examples/s] Generating validation split: 0%| | 0/823 [00:00<?, ? examples/s] Generating test split: 0%| | 0/820 [00:00<?, ? examples/s] 0%| | 0/3 [00:00<?, ?it/s]100%|##########| 3/3 [00:00<00:00, 1082.21it/s]
Saving the dataset (0/1 shards): 0%| | 0/7 [00:00<?, ? examples/s]Saving the dataset (1/1 shards): 100%|##########| 7/7 [00:00<00:00, 2845.53 examples/s] Saving the dataset (0/1 shards): 0%| | 0/5 [00:00<?, ? examples/s]Saving the dataset (1/1 shards): 100%|##########| 5/5 [00:00<00:00, 2273.09 examples/s] Saving the dataset (0/1 shards): 0%| | 0/5 [00:00<?, ? examples/s]Saving the dataset (1/1 shards): 100%|##########| 5/5 [00:00<00:00, 2215.46 examples/s]
DatasetDict({
train: Dataset({
features: ['text', 'target'],
num_rows: 7
})
validation: Dataset({
features: ['text', 'target'],
num_rows: 5
})
test: Dataset({
features: ['text', 'target'],
num_rows: 5
})
})
We have now our data prepared locally, now we need to define our pipeline
.
Let’s start from config. We will use parameters from clarin-pl/lepiszcze-allegro__herbert-base-cased-polemo2
, which configuration was obtained from extensive hyperparmeter search
.
Due to hardware limitation we limit parmeter max_epochs
to 1 and we leave early stopping
configuration parameters as defaults
LightningBasicConfig
LightningBasicConfig (use_scheduler:bool=True, optimizer:str='Adam', warmup_steps:int=100, learning_rate:float=0.0001, adam_epsilon:float=1e-08, weight_decay:float=0.0, finetune_last_n_layers:int=-1, classifier_dropout:Optional[float]=None, max_seq_length:Optional[int]=None, batch_size:int=32, max_epochs:Optional[int]=None, early_stopping_monitor:str='val/Loss', early_stopping_mode:str='min', early_stopping_patience:int=3)
= LightningBasicConfig(
config =True,
use_scheduler="Adam",
optimizer=100,
warmup_steps=0.001,
learning_rate=1e-06,
adam_epsilon=0.001,
weight_decay=3,
finetune_last_n_layers=0.2,
classifier_dropout=None,
max_seq_length=64,
batch_size=1,
max_epochs
) config
LightningBasicConfig(use_scheduler=True, optimizer='Adam', warmup_steps=100, learning_rate=0.001, adam_epsilon=1e-06, weight_decay=0.001, finetune_last_n_layers=3, classifier_dropout=0.2, max_seq_length=None, batch_size=64, max_epochs=1, early_stopping_monitor='val/Loss', early_stopping_mode='min', early_stopping_patience=3, tokenizer_kwargs={}, batch_encoding_kwargs={}, dataloader_kwargs={})
Now we define pipeline dedicated for text classification LightningClassificationPipeline
from embeddings.pipeline.lightning_classification import LightningClassificationPipeline
LightningClassificationPipeline
LightningClassificationPipeline (embedding_name_or_path:Union[str,pathli b.Path], dataset_name_or_path:Union[str, pathlib.Path], input_column_name:Union[s tr,Sequence[str]], target_column_name:str, output_path:Union[str,pathlib.Path], eva luation_filename:str='evaluation.json', config:Union[embeddings.config.lightning _config.LightningBasicConfig,embeddings. config.lightning_config.LightningAdvance dConfig]=LightningBasicConfig(use_schedu ler=True, optimizer='Adam', warmup_steps=100, learning_rate=0.0001, adam_epsilon=1e-08, weight_decay=0.0, finetune_last_n_layers=-1, classifier_dropout=None, max_seq_length=None, batch_size=32, max_epochs=None, early_stopping_monitor='val/Loss', early_stopping_mode='min', early_stopping_patience=3, tokenizer_kwargs={}, batch_encoding_kwargs={}, dataloader_kwargs={}), devices:Union[Lis t[int],str,int,NoneType]='auto', acceler ator:Union[str,pytorch_lightning.acceler ators.accelerator.Accelerator,NoneType]= 'auto', logging_config:embeddings.utils. loggers.LightningLoggingConfig=Lightning LoggingConfig(output_path='.', loggers_names=[], tracking_project_name=None, wandb_entity=None, wandb_logger_kwargs={}, loggers=None), t okenizer_name_or_path:Union[pathlib.Path ,str,NoneType]=None, predict_subset:embe ddings.data.dataset.LightingDataModuleSu bset=<LightingDataModuleSubset.TEST: 'test'>, load_dataset_kwargs:Optional[Di ct[str,Any]]=None, model_checkpoint_kwar gs:Optional[Dict[str,Any]]=None, compile _model_kwargs:Optional[Dict[str,Any]]=No ne)
Helper class that provides a standard way to create an ABC using inheritance.
from dataclasses import asdict # For metrics conversion
import pandas as pd # For metrics conversion
= LightningClassificationPipeline(
pipeline ="hf-internal-testing/tiny-albert",
embedding_name_or_path="data/polemo2_downsampled/",
dataset_name_or_path="text",
input_column_name="target",
target_column_name=".",
output_path="auto",
devices="cpu",
accelerator=config
config )
Downloading tokenizer_config.json: 0%| | 0.00/422 [00:00<?, ?B/s]Downloading tokenizer_config.json: 100%|##########| 422/422 [00:00<00:00, 182kB/s]
Downloading spiece.model: 0%| | 0.00/321k [00:00<?, ?B/s]Downloading spiece.model: 100%|##########| 321k/321k [00:00<00:00, 13.6MB/s]
Downloading tokenizer.json: 0%| | 0.00/478k [00:00<?, ?B/s]Downloading tokenizer.json: 100%|##########| 478k/478k [00:00<00:00, 17.9MB/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/244 [00:00<?, ?B/s]Downloading (…)cial_tokens_map.json: 100%|##########| 244/244 [00:00<00:00, 168kB/s]
Map: 0%| | 0/7 [00:00<?, ? examples/s] Map: 0%| | 0/5 [00:00<?, ? examples/s] Map: 0%| | 0/5 [00:00<?, ? examples/s] Casting the dataset: 0%| | 0/7 [00:00<?, ? examples/s] Casting the dataset: 0%| | 0/5 [00:00<?, ? examples/s] Casting the dataset: 0%| | 0/5 [00:00<?, ? examples/s]
Similarly as with HuggingFacePreprocessingPipeline we use run
method
LightningPipeline.run
LightningPipeline.run (run_name:Optional[str]=None)
= pipeline.run() metrics
Sanity Checking: 0it [00:00, ?it/s]Sanity Checking: 0%| | 0/1 [00:00<?, ?it/s]Sanity Checking DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]Sanity Checking DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 73.55it/s] Training: 0it [00:00, ?it/s]Training: 0%| | 0/1 [00:00<?, ?it/s]Epoch 0: 0%| | 0/1 [00:00<?, ?it/s] Epoch 0: 100%|##########| 1/1 [00:00<00:00, 32.32it/s]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 32.05it/s, train/BaseLR=0.000, train/LambdaLR=0.000]
Validation: 0it [00:00, ?it/s]
Validation: 0%| | 0/1 [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 183.45it/s]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 23.24it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Epoch 0: 100%|##########| 1/1 [00:00<00:00, 22.83it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 17.14it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Predicting: 0it [00:00, ?it/s]Predicting: 0%| | 0/1 [00:00<?, ?it/s]Predicting DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 199.41it/s]Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 180.49it/s]
Downloading config.json: 0%| | 0.00/787 [00:00<?, ?B/s]Downloading config.json: 100%|##########| 787/787 [00:00<00:00, 380kB/s]
Downloading pytorch_model.bin: 0%| | 0.00/730k [00:00<?, ?B/s]Downloading pytorch_model.bin: 100%|##########| 730k/730k [00:00<00:00, 20.3MB/s]
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:189: UserWarning: .predict(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, predict_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Downloading builder script: 0%| | 0.00/4.20k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 4.20k/4.20k [00:00<00:00, 4.39MB/s]
Downloading builder script: 0%| | 0.00/6.77k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 6.77k/6.77k [00:00<00:00, 6.69MB/s]
Downloading builder script: 0%| | 0.00/7.36k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 7.36k/7.36k [00:00<00:00, 7.39MB/s]
Downloading builder script: 0%| | 0.00/7.55k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 7.55k/7.55k [00:00<00:00, 7.49MB/s]
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
= pd.DataFrame.from_dict(asdict(metrics), orient="index", columns=["values"])
metrics metrics
values | |
---|---|
accuracy | 0.0 |
f1_macro | 0.0 |
f1_micro | 0.0 |
f1_weighted | 0.0 |
recall_macro | 0.0 |
recall_micro | 0.0 |
recall_weighted | 0.0 |
precision_macro | 0.0 |
precision_micro | 0.0 |
precision_weighted | 0.0 |
classes | {0: {'precision': 0.0, 'recall': 0.0, 'f1': 0.... |
data | {'y_pred': [0, 0, 0, 0, 0], 'y_true': [1, 1, 1... |
Running pipeline with AdvancedConfig
As mentioned in previous section LightningBasicConfig
is only limited to most important parameters.
Let’s see an example of the process of defining the parameters in our LightningAdvancedConfig
. Tracing back different kwargs we can find:
task_train_kwargs
Parameters that are passed to theLightning Trainer
object.task_model_kwargs
Parameters that are passed to theLightning module
object (we useTextClassificationModule
which inherits fromHuggingFaceLightningModule
andHuggingFaceLightningModule
).datamodule_kwargs
Parameters passed to the datamodule classes, currentlyHuggingFaceDataModule
takes several arguments (such as max_seq_length, processing_batch_size or downsamples args) as an inputbatch_encoding_kwargs
Parameters that are defined in__call__
method of the tokenizer which allow for manipulation of the tokenized text by setting parameters such as truncation, padding, stride etc. and specifying the return format of the tokenized texttokenizer_kwargs
This is a generic configuration class of the hugginface model’s tokenizer, possible parameters depends on the tokenizer that is used. For example for bert uncased tokenizer these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/tokenizer_config.jsonload_dataset_kwargs
Keyword arguments from thedatasets.load_dataset method
which loads a dataset from the Hugging Face Hub, or a local dataset; mostly metadata for downloading, loading, caching the datasetmodel_config_kwargs
This is a generic configuration class of the hugginface model, possible parameters depends on the model that is used. For example for bert uncased these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/config.jsonearly_stopping_kwargs
Params defined in__init__
of theEarlyStopping
lightning callback; you can specify a metric to monitor and conditions to stop training when it stops improvingdataloader_kwargs
Defined in__init__
of the torchDataLoader
object which wraps an iterable around the Dataset to enable easy access to the sample; specify params such as num of workers, sampling or shuffling
Lets create an advanced config with all the parameters we want to use.
= LightningAdvancedConfig(
advanced_config =0,
finetune_last_n_layers={
datamodule_kwargs"max_seq_length": None,
},={
task_train_kwargs"max_epochs": 1,
"devices": "auto",
"accelerator": "cpu",
"deterministic": True,
},={
task_model_kwargs"learning_rate": 0.001,
"train_batch_size": 64,
"eval_batch_size": 64,
"use_scheduler": True,
"optimizer": "Adam",
"adam_epsilon": 1e-6,
"warmup_steps": 100,
"weight_decay": 0.001,
},=None,
early_stopping_kwargs={"classifier_dropout": 0.2},
model_config_kwargs={},
tokenizer_kwargs={},
batch_encoding_kwargs={}
dataloader_kwargs
) advanced_config
LightningAdvancedConfig(finetune_last_n_layers=0, task_model_kwargs={'learning_rate': 0.001, 'train_batch_size': 64, 'eval_batch_size': 64, 'use_scheduler': True, 'optimizer': 'Adam', 'adam_epsilon': 1e-06, 'warmup_steps': 100, 'weight_decay': 0.001}, datamodule_kwargs={'max_seq_length': None}, task_train_kwargs={'max_epochs': 1, 'devices': 'auto', 'accelerator': 'cpu', 'deterministic': True}, model_config_kwargs={'classifier_dropout': 0.2}, early_stopping_kwargs=None, tokenizer_kwargs={}, batch_encoding_kwargs={}, dataloader_kwargs={})
Now we can add config the pipeline and run it.
= LightningClassificationPipeline(
pipeline ="hf-internal-testing/tiny-albert",
embedding_name_or_path="data/polemo2_downsampled/",
dataset_name_or_path="text",
input_column_name="target",
target_column_name=".",
output_path="auto",
devices="cpu",
accelerator=advanced_config
config
)
= pipeline.run() metrics_adv_cfg
Sanity Checking: 0it [00:00, ?it/s]Sanity Checking: 0%| | 0/1 [00:00<?, ?it/s]Sanity Checking DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]Sanity Checking DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 134.73it/s] Training: 0it [00:00, ?it/s]Training: 0%| | 0/1 [00:00<?, ?it/s]Epoch 0: 0%| | 0/1 [00:00<?, ?it/s] Epoch 0: 100%|##########| 1/1 [00:00<00:00, 66.37it/s]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 65.52it/s, train/BaseLR=0.000, train/LambdaLR=0.000]
Validation: 0it [00:00, ?it/s]
Validation: 0%| | 0/1 [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 181.86it/s]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 37.33it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Epoch 0: 100%|##########| 1/1 [00:00<00:00, 36.28it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 27.31it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Predicting: 0it [00:00, ?it/s]Predicting: 0%| | 0/1 [00:00<?, ?it/s]Predicting DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 198.31it/s]Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 187.12it/s]
Loading cached processed dataset at /home/runner/work/embeddings/embeddings/data/polemo2_downsampled/train/cache-f234959a87a0d2f7.arrow
Map: 0%| | 0/5 [00:00<?, ? examples/s] Loading cached processed dataset at /home/runner/work/embeddings/embeddings/data/polemo2_downsampled/test/cache-fcf83583c33e778e.arrow
Loading cached processed dataset at /home/runner/work/embeddings/embeddings/data/polemo2_downsampled/train/cache-21d141ad052dbfb6.arrow
Casting the dataset: 0%| | 0/5 [00:00<?, ? examples/s] Loading cached processed dataset at /home/runner/work/embeddings/embeddings/data/polemo2_downsampled/test/cache-ea613b61704ed6a7.arrow
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:612: UserWarning: Checkpoint directory /home/runner/work/embeddings/embeddings/checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:189: UserWarning: .predict(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, predict_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Finally, we can check out some of the metrics.
= pd.DataFrame.from_dict(asdict(metrics_adv_cfg), orient="index", columns=["values"])
metrics_adv_cfg metrics_adv_cfg
values | |
---|---|
accuracy | 0.6 |
f1_macro | 0.25 |
f1_micro | 0.6 |
f1_weighted | 0.45 |
recall_macro | 0.333333 |
recall_micro | 0.6 |
recall_weighted | 0.6 |
precision_macro | 0.2 |
precision_micro | 0.6 |
precision_weighted | 0.36 |
classes | {0: {'precision': 0.6, 'recall': 1.0, 'f1': 0.... |
data | {'y_pred': [1, 1, 1, 1, 1], 'y_true': [1, 1, 1... |
We used a very small dataset and very small Language Model, so the results are not very good. However, in reality we surely will get better results with more sophisticated models and larger datasets.
Good luck in your experiments!