from embeddings.pipeline.hf_preprocessing_pipeline import HuggingFacePreprocessingPipelineTwo types of config are defined in our library: BasicConfig and AdvancedConfig.
BasicConfig
allows for easy use of the most common parameters in the pipeline.
LightningBasicConfig
LightningBasicConfig (use_scheduler:bool=True, optimizer:str='Adam', warmup_steps:int=100, learning_rate:float=0.0001, adam_epsilon:float=1e-08, weight_decay:float=0.0, finetune_last_n_layers:int=-1, classifier_dropout:Optional[float]=None, max_seq_length:Optional[int]=None, batch_size:int=32, max_epochs:Optional[int]=None, early_stopping_monitor:str='val/Loss', early_stopping_mode:str='min', early_stopping_patience:int=3)
AdvancedConfig
the objects defined in our pipelines are constructed in a way that they can be further paramatrized with keyword arguments. These arguments can be utilized by constructing the
AdvancedConfig.
LightningAdvancedConfig
LightningAdvancedConfig (finetune_last_n_layers:int, task_model_kwargs:Dict[str,Any], datamodule_kwargs:Dict[str,Any], task_train_kwargs:Dict[str,Any], model_config_kwargs:Dict[str,Any], early_stopping_kwargs:Dict[str,Any], tokenizer_kwargs:Dict[str,Any], batch_encoding_kwargs:Dict[str,Any], dataloader_kwargs:Dict[str,Any])
In summary, the BasicConfig takes arguments and automatically assign them into proper keyword group, while the AdvancedConfig takes as the input keyword groups that should be already correctly mapped.
The list of available config can be found below.
Running pipeline with BasicConfig
Let’s run example pipeline on polemo2 dataset
But first we downsample our dataset due to hardware limitations for that purpose we use HuggingFacePreprocessingPipeline
HuggingFacePreprocessingPipeline
HuggingFacePreprocessingPipeline (dataset_name:str, persist_path:str, sam ple_missing_splits:Optional[Tuple[Optio nal[float],Optional[float]]]=None, down sample_splits:Optional[Tuple[Optional[f loat],Optional[float],Optional[float]]] =None, ignore_test_subset:bool=False, seed:int=441, load_dataset_kwargs:Optio nal[Dict[str,Any]]=None)
Preprocessing pipeline dedicated to work with HuggingFace datasets.
Then we need to use run method
PreprocessingPipeline.run
PreprocessingPipeline.run ()
prepocessing = HuggingFacePreprocessingPipeline(
dataset_name="clarin-pl/polemo2-official",
persist_path="data/polemo2_downsampled",
downsample_splits=(0.001, 0.005, 0.005)
)
prepocessing.run()Downloading and preparing dataset polemo2-official/all_text (download: 6.37 MiB, generated: 6.30 MiB, post-processed: Unknown size, total: 12.68 MiB) to /home/runner/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70...
Dataset polemo2-official downloaded and prepared to /home/runner/.cache/huggingface/datasets/clarin-pl___polemo2-official/all_text/0.0.0/2b75fdbe5def97538e81fb120f8752744b50729a4ce09bd75132bfc863a2fd70. Subsequent calls will reuse this data.
Downloading builder script: 0%| | 0.00/5.90k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 5.90k/5.90k [00:00<00:00, 5.38MB/s]
Downloading metadata: 0%| | 0.00/23.4k [00:00<?, ?B/s]Downloading metadata: 100%|##########| 23.4k/23.4k [00:00<00:00, 20.0MB/s]
Downloading readme: 0%| | 0.00/5.35k [00:00<?, ?B/s]Downloading readme: 100%|##########| 5.35k/5.35k [00:00<00:00, 5.18MB/s]
No config specified, defaulting to: polemo2-official/all_text
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/5.37M [00:00<?, ?B/s]Downloading data: 100%|##########| 5.37M/5.37M [00:00<00:00, 57.6MB/s]
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 1.57it/s]Downloading data files: 100%|##########| 1/1 [00:00<00:00, 1.57it/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]Extracting data files: 100%|##########| 1/1 [00:00<00:00, 1903.04it/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/663k [00:00<?, ?B/s]Downloading data: 100%|##########| 663k/663k [00:00<00:00, 26.6MB/s]
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 3.19it/s]Downloading data files: 100%|##########| 1/1 [00:00<00:00, 3.19it/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]Extracting data files: 100%|##########| 1/1 [00:00<00:00, 1981.25it/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/649k [00:00<?, ?B/s]Downloading data: 100%|##########| 649k/649k [00:00<00:00, 20.6MB/s]
Downloading data files: 100%|##########| 1/1 [00:00<00:00, 3.35it/s]Downloading data files: 100%|##########| 1/1 [00:00<00:00, 3.35it/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]Extracting data files: 100%|##########| 1/1 [00:00<00:00, 2004.93it/s]
Generating train split: 0%| | 0/6573 [00:00<?, ? examples/s]Generating train split: 40%|#### | 2632/6573 [00:00<00:00, 26267.54 examples/s]Generating train split: 82%|########2 | 5393/6573 [00:00<00:00, 26483.84 examples/s] Generating validation split: 0%| | 0/823 [00:00<?, ? examples/s] Generating test split: 0%| | 0/820 [00:00<?, ? examples/s] 0%| | 0/3 [00:00<?, ?it/s]100%|##########| 3/3 [00:00<00:00, 1082.21it/s]
Saving the dataset (0/1 shards): 0%| | 0/7 [00:00<?, ? examples/s]Saving the dataset (1/1 shards): 100%|##########| 7/7 [00:00<00:00, 2845.53 examples/s] Saving the dataset (0/1 shards): 0%| | 0/5 [00:00<?, ? examples/s]Saving the dataset (1/1 shards): 100%|##########| 5/5 [00:00<00:00, 2273.09 examples/s] Saving the dataset (0/1 shards): 0%| | 0/5 [00:00<?, ? examples/s]Saving the dataset (1/1 shards): 100%|##########| 5/5 [00:00<00:00, 2215.46 examples/s]
DatasetDict({
train: Dataset({
features: ['text', 'target'],
num_rows: 7
})
validation: Dataset({
features: ['text', 'target'],
num_rows: 5
})
test: Dataset({
features: ['text', 'target'],
num_rows: 5
})
})
We have now our data prepared locally, now we need to define our pipeline.
Let’s start from config. We will use parameters from clarin-pl/lepiszcze-allegro__herbert-base-cased-polemo2, which configuration was obtained from extensive hyperparmeter search.
Due to hardware limitation we limit parmeter max_epochs to 1 and we leave early stopping configuration parameters as defaults
LightningBasicConfig
LightningBasicConfig (use_scheduler:bool=True, optimizer:str='Adam', warmup_steps:int=100, learning_rate:float=0.0001, adam_epsilon:float=1e-08, weight_decay:float=0.0, finetune_last_n_layers:int=-1, classifier_dropout:Optional[float]=None, max_seq_length:Optional[int]=None, batch_size:int=32, max_epochs:Optional[int]=None, early_stopping_monitor:str='val/Loss', early_stopping_mode:str='min', early_stopping_patience:int=3)
config = LightningBasicConfig(
use_scheduler=True,
optimizer="Adam",
warmup_steps=100,
learning_rate=0.001,
adam_epsilon=1e-06,
weight_decay=0.001,
finetune_last_n_layers=3,
classifier_dropout=0.2,
max_seq_length=None,
batch_size=64,
max_epochs=1,
)
configLightningBasicConfig(use_scheduler=True, optimizer='Adam', warmup_steps=100, learning_rate=0.001, adam_epsilon=1e-06, weight_decay=0.001, finetune_last_n_layers=3, classifier_dropout=0.2, max_seq_length=None, batch_size=64, max_epochs=1, early_stopping_monitor='val/Loss', early_stopping_mode='min', early_stopping_patience=3, tokenizer_kwargs={}, batch_encoding_kwargs={}, dataloader_kwargs={})
Now we define pipeline dedicated for text classification LightningClassificationPipeline
from embeddings.pipeline.lightning_classification import LightningClassificationPipelineLightningClassificationPipeline
LightningClassificationPipeline (embedding_name_or_path:Union[str,pathli b.Path], dataset_name_or_path:Union[str, pathlib.Path], input_column_name:Union[s tr,Sequence[str]], target_column_name:str, output_path:Union[str,pathlib.Path], eva luation_filename:str='evaluation.json', config:Union[embeddings.config.lightning _config.LightningBasicConfig,embeddings. config.lightning_config.LightningAdvance dConfig]=LightningBasicConfig(use_schedu ler=True, optimizer='Adam', warmup_steps=100, learning_rate=0.0001, adam_epsilon=1e-08, weight_decay=0.0, finetune_last_n_layers=-1, classifier_dropout=None, max_seq_length=None, batch_size=32, max_epochs=None, early_stopping_monitor='val/Loss', early_stopping_mode='min', early_stopping_patience=3, tokenizer_kwargs={}, batch_encoding_kwargs={}, dataloader_kwargs={}), devices:Union[Lis t[int],str,int,NoneType]='auto', acceler ator:Union[str,pytorch_lightning.acceler ators.accelerator.Accelerator,NoneType]= 'auto', logging_config:embeddings.utils. loggers.LightningLoggingConfig=Lightning LoggingConfig(output_path='.', loggers_names=[], tracking_project_name=None, wandb_entity=None, wandb_logger_kwargs={}, loggers=None), t okenizer_name_or_path:Union[pathlib.Path ,str,NoneType]=None, predict_subset:embe ddings.data.dataset.LightingDataModuleSu bset=<LightingDataModuleSubset.TEST: 'test'>, load_dataset_kwargs:Optional[Di ct[str,Any]]=None, model_checkpoint_kwar gs:Optional[Dict[str,Any]]=None, compile _model_kwargs:Optional[Dict[str,Any]]=No ne)
Helper class that provides a standard way to create an ABC using inheritance.
from dataclasses import asdict # For metrics conversion
import pandas as pd # For metrics conversionpipeline = LightningClassificationPipeline(
embedding_name_or_path="hf-internal-testing/tiny-albert",
dataset_name_or_path="data/polemo2_downsampled/",
input_column_name="text",
target_column_name="target",
output_path=".",
devices="auto",
accelerator="cpu",
config=config
)Downloading tokenizer_config.json: 0%| | 0.00/422 [00:00<?, ?B/s]Downloading tokenizer_config.json: 100%|##########| 422/422 [00:00<00:00, 182kB/s]
Downloading spiece.model: 0%| | 0.00/321k [00:00<?, ?B/s]Downloading spiece.model: 100%|##########| 321k/321k [00:00<00:00, 13.6MB/s]
Downloading tokenizer.json: 0%| | 0.00/478k [00:00<?, ?B/s]Downloading tokenizer.json: 100%|##########| 478k/478k [00:00<00:00, 17.9MB/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/244 [00:00<?, ?B/s]Downloading (…)cial_tokens_map.json: 100%|##########| 244/244 [00:00<00:00, 168kB/s]
Map: 0%| | 0/7 [00:00<?, ? examples/s] Map: 0%| | 0/5 [00:00<?, ? examples/s] Map: 0%| | 0/5 [00:00<?, ? examples/s] Casting the dataset: 0%| | 0/7 [00:00<?, ? examples/s] Casting the dataset: 0%| | 0/5 [00:00<?, ? examples/s] Casting the dataset: 0%| | 0/5 [00:00<?, ? examples/s]
Similarly as with HuggingFacePreprocessingPipeline we use run method
LightningPipeline.run
LightningPipeline.run (run_name:Optional[str]=None)
metrics = pipeline.run()Sanity Checking: 0it [00:00, ?it/s]Sanity Checking: 0%| | 0/1 [00:00<?, ?it/s]Sanity Checking DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]Sanity Checking DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 73.55it/s] Training: 0it [00:00, ?it/s]Training: 0%| | 0/1 [00:00<?, ?it/s]Epoch 0: 0%| | 0/1 [00:00<?, ?it/s] Epoch 0: 100%|##########| 1/1 [00:00<00:00, 32.32it/s]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 32.05it/s, train/BaseLR=0.000, train/LambdaLR=0.000]
Validation: 0it [00:00, ?it/s]
Validation: 0%| | 0/1 [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 183.45it/s]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 23.24it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Epoch 0: 100%|##########| 1/1 [00:00<00:00, 22.83it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 17.14it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Predicting: 0it [00:00, ?it/s]Predicting: 0%| | 0/1 [00:00<?, ?it/s]Predicting DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 199.41it/s]Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 180.49it/s]
Downloading config.json: 0%| | 0.00/787 [00:00<?, ?B/s]Downloading config.json: 100%|##########| 787/787 [00:00<00:00, 380kB/s]
Downloading pytorch_model.bin: 0%| | 0.00/730k [00:00<?, ?B/s]Downloading pytorch_model.bin: 100%|##########| 730k/730k [00:00<00:00, 20.3MB/s]
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:189: UserWarning: .predict(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, predict_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Downloading builder script: 0%| | 0.00/4.20k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 4.20k/4.20k [00:00<00:00, 4.39MB/s]
Downloading builder script: 0%| | 0.00/6.77k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 6.77k/6.77k [00:00<00:00, 6.69MB/s]
Downloading builder script: 0%| | 0.00/7.36k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 7.36k/7.36k [00:00<00:00, 7.39MB/s]
Downloading builder script: 0%| | 0.00/7.55k [00:00<?, ?B/s]Downloading builder script: 100%|##########| 7.55k/7.55k [00:00<00:00, 7.49MB/s]
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
metrics = pd.DataFrame.from_dict(asdict(metrics), orient="index", columns=["values"])
metrics| values | |
|---|---|
| accuracy | 0.0 |
| f1_macro | 0.0 |
| f1_micro | 0.0 |
| f1_weighted | 0.0 |
| recall_macro | 0.0 |
| recall_micro | 0.0 |
| recall_weighted | 0.0 |
| precision_macro | 0.0 |
| precision_micro | 0.0 |
| precision_weighted | 0.0 |
| classes | {0: {'precision': 0.0, 'recall': 0.0, 'f1': 0.... |
| data | {'y_pred': [0, 0, 0, 0, 0], 'y_true': [1, 1, 1... |
Running pipeline with AdvancedConfig
As mentioned in previous section LightningBasicConfig is only limited to most important parameters.
Let’s see an example of the process of defining the parameters in our LightningAdvancedConfig. Tracing back different kwargs we can find:
task_train_kwargsParameters that are passed to theLightning Trainerobject.task_model_kwargsParameters that are passed to theLightning moduleobject (we useTextClassificationModulewhich inherits fromHuggingFaceLightningModuleandHuggingFaceLightningModule).datamodule_kwargs
Parameters passed to the datamodule classes, currentlyHuggingFaceDataModuletakes several arguments (such as max_seq_length, processing_batch_size or downsamples args) as an inputbatch_encoding_kwargsParameters that are defined in__call__method of the tokenizer which allow for manipulation of the tokenized text by setting parameters such as truncation, padding, stride etc. and specifying the return format of the tokenized texttokenizer_kwargsThis is a generic configuration class of the hugginface model’s tokenizer, possible parameters depends on the tokenizer that is used. For example for bert uncased tokenizer these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/tokenizer_config.jsonload_dataset_kwargsKeyword arguments from thedatasets.load_dataset methodwhich loads a dataset from the Hugging Face Hub, or a local dataset; mostly metadata for downloading, loading, caching the datasetmodel_config_kwargsThis is a generic configuration class of the hugginface model, possible parameters depends on the model that is used. For example for bert uncased these parameters are present here: https://huggingface.co/bert-base-uncased/blob/main/config.jsonearly_stopping_kwargsParams defined in__init__of theEarlyStoppinglightning callback; you can specify a metric to monitor and conditions to stop training when it stops improvingdataloader_kwargsDefined in__init__of the torchDataLoaderobject which wraps an iterable around the Dataset to enable easy access to the sample; specify params such as num of workers, sampling or shuffling
Lets create an advanced config with all the parameters we want to use.
advanced_config = LightningAdvancedConfig(
finetune_last_n_layers=0,
datamodule_kwargs={
"max_seq_length": None,
},
task_train_kwargs={
"max_epochs": 1,
"devices": "auto",
"accelerator": "cpu",
"deterministic": True,
},
task_model_kwargs={
"learning_rate": 0.001,
"train_batch_size": 64,
"eval_batch_size": 64,
"use_scheduler": True,
"optimizer": "Adam",
"adam_epsilon": 1e-6,
"warmup_steps": 100,
"weight_decay": 0.001,
},
early_stopping_kwargs=None,
model_config_kwargs={"classifier_dropout": 0.2},
tokenizer_kwargs={},
batch_encoding_kwargs={},
dataloader_kwargs={}
)
advanced_configLightningAdvancedConfig(finetune_last_n_layers=0, task_model_kwargs={'learning_rate': 0.001, 'train_batch_size': 64, 'eval_batch_size': 64, 'use_scheduler': True, 'optimizer': 'Adam', 'adam_epsilon': 1e-06, 'warmup_steps': 100, 'weight_decay': 0.001}, datamodule_kwargs={'max_seq_length': None}, task_train_kwargs={'max_epochs': 1, 'devices': 'auto', 'accelerator': 'cpu', 'deterministic': True}, model_config_kwargs={'classifier_dropout': 0.2}, early_stopping_kwargs=None, tokenizer_kwargs={}, batch_encoding_kwargs={}, dataloader_kwargs={})
Now we can add config the pipeline and run it.
pipeline = LightningClassificationPipeline(
embedding_name_or_path="hf-internal-testing/tiny-albert",
dataset_name_or_path="data/polemo2_downsampled/",
input_column_name="text",
target_column_name="target",
output_path=".",
devices="auto",
accelerator="cpu",
config=advanced_config
)
metrics_adv_cfg = pipeline.run()Sanity Checking: 0it [00:00, ?it/s]Sanity Checking: 0%| | 0/1 [00:00<?, ?it/s]Sanity Checking DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]Sanity Checking DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 134.73it/s] Training: 0it [00:00, ?it/s]Training: 0%| | 0/1 [00:00<?, ?it/s]Epoch 0: 0%| | 0/1 [00:00<?, ?it/s] Epoch 0: 100%|##########| 1/1 [00:00<00:00, 66.37it/s]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 65.52it/s, train/BaseLR=0.000, train/LambdaLR=0.000]
Validation: 0it [00:00, ?it/s]
Validation: 0%| | 0/1 [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 181.86it/s]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 37.33it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Epoch 0: 100%|##########| 1/1 [00:00<00:00, 36.28it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]Epoch 0: 100%|##########| 1/1 [00:00<00:00, 27.31it/s, train/BaseLR=0.000, train/LambdaLR=0.000, val/MulticlassAccuracy=0.400, val/MulticlassPrecision=0.100, val/MulticlassRecall=0.250, val/MulticlassF1Score=0.143]
Predicting: 0it [00:00, ?it/s]Predicting: 0%| | 0/1 [00:00<?, ?it/s]Predicting DataLoader 0: 0%| | 0/1 [00:00<?, ?it/s]Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 198.31it/s]Predicting DataLoader 0: 100%|##########| 1/1 [00:00<00:00, 187.12it/s]
Loading cached processed dataset at /home/runner/work/embeddings/embeddings/data/polemo2_downsampled/train/cache-f234959a87a0d2f7.arrow
Map: 0%| | 0/5 [00:00<?, ? examples/s] Loading cached processed dataset at /home/runner/work/embeddings/embeddings/data/polemo2_downsampled/test/cache-fcf83583c33e778e.arrow
Loading cached processed dataset at /home/runner/work/embeddings/embeddings/data/polemo2_downsampled/train/cache-21d141ad052dbfb6.arrow
Casting the dataset: 0%| | 0/5 [00:00<?, ? examples/s] Loading cached processed dataset at /home/runner/work/embeddings/embeddings/data/polemo2_downsampled/test/cache-ea613b61704ed6a7.arrow
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:612: UserWarning: Checkpoint directory /home/runner/work/embeddings/embeddings/checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, val_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:189: UserWarning: .predict(ckpt_path="last") is set, but there is no last checkpoint available. No checkpoint will be loaded.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, predict_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/runner/work/embeddings/embeddings/.venv/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Finally, we can check out some of the metrics.
metrics_adv_cfg = pd.DataFrame.from_dict(asdict(metrics_adv_cfg), orient="index", columns=["values"])
metrics_adv_cfg| values | |
|---|---|
| accuracy | 0.6 |
| f1_macro | 0.25 |
| f1_micro | 0.6 |
| f1_weighted | 0.45 |
| recall_macro | 0.333333 |
| recall_micro | 0.6 |
| recall_weighted | 0.6 |
| precision_macro | 0.2 |
| precision_micro | 0.6 |
| precision_weighted | 0.36 |
| classes | {0: {'precision': 0.6, 'recall': 1.0, 'f1': 0.... |
| data | {'y_pred': [1, 1, 1, 1, 1], 'y_true': [1, 1, 1... |
We used a very small dataset and very small Language Model, so the results are not very good. However, in reality we surely will get better results with more sophisticated models and larger datasets.
Good luck in your experiments!