Rush",īooktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", Resultsīoth models follow the same = "Transformers: State-of-the-Art Natural Language Processing",Īuthor = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Text augmentation was performed with synonym replacement using BERT embeddings.Īny files or folders with unbalanced or balanced in the name is in relation to these two datasets. Until we reached a 50-50 class distribution. To get our balanced dataset, we used nlpaug to augment the minority class of the unbalanced dataset To get our unbalanced dataset, we undersampled the majority class of this intermediate dataset until toxic comments make up 20.15% of all data. Toxic comments make up 9.58% of this intermediate dataset. Toxic ('isToxic = 1) or non-toxic ('isToxic = 0). Instead, we converted the original dataset to a binary classification task where labels are either classified as Toxic, Severe Toxic, Obscene, Threat, Insult, Identity Hate),īut we decided against using these labels due to their subjectivity. This dataset has labels intended for a multi-label classification task (e.g. If anything is confusing, please see the accompanying Medium article for an explanation of my methodologies! DatasetsĪs mentioned previously, the datasets used to train our models are based on the Jigsaw Toxic Comment dataset found on Kaggle. To predict toxic comments on a modified version of the Jigsaw Toxic Comment dataset on Kaggle.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |