ParsBERT

A monolingual language model based on Google’s BERT architecture.

🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

ParsBERT trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page scientific, Chetor lifestyle, Eligasht itinerary, Digikala digital magazine, Ted Talks general conversational, Books novels, storybooks, short stories from old to the contemporary era).

As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpora into a proper format.

Follow the rest of the repo for more details.

Paper link: 10.1007/s11063-021-10528-4


YouTube Demo !

References

2020

  1. ParsBERT: Transformer-based Model for Persian Language Understanding
    Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, and 1 more author
    Neural Processing Letters, 2020