Mukhammadsaid Mamasaidov

I'm a 23-years old research scientist, working mostly with natural language processing. I'm a founder of Tahrirchi, an R&D company that develops language solutions for Uzbek, such as spell and grammar checking. I'm also interested in other low-resource Turkic languages, like Kazakh and Kyrgyz.

At Tahrirchi I've implemented morphological analyzers for Uzbek and Kazakh, pre-trained large language models for Uzbek, and created a grammar checker for Uzbek, the first of its kind among Turkic languages, by adapting Grammarly's model for Uzbek. I've led a team that won 50k$ at mGovAward 2022, "Best Research Project for the Uzbek Language 2021", and was nominated for "Builder of the Future" of Uzbekistan medal.

We also contribute to the open-source community of Uzbek NLP. Recently, we have open-sourced the biggest corpus of Uzbek (36 GB) consisting of more than 35,000 books. Additionally, we also open-sourced several pre-trained language models for Uzbek.

Email / CV / Scholar / LinkedIn / Github

Research

I'm interested in nature language processing and machine learning. Specifically, I focus on Spelling Correction and Grammatical Error Correction (GEC) tasks.

	Grammatical Error Correction (GEC) for Agglutinative Languages: Case of Uzbek Mukhammadsaid Mamasaidov In draft, 2023 project page (in Uzbek) / demo Using Grammarly's GECToR model, we created a grammar checker for Uzbek. The idea is to incorporate the knowledge of rule-based morphological analyzers into the neural network, which makes generalization much greater for agglutinative languages, like Uzbek.
	Tahrirgoh: Data Annotation Platform for Grammatical Error Correction Mukhammadsaid Mamasaidov, Jasur Yusupov 2023 code We created a minimalistic data annotation platform for grammatical error correction data collection.
	UzBooks and UzCrawl: the biggest open-sourced Uzbek corpora Mukhammadsaid Mamasaidov, Abror Shopulatov 2023 UzBooks / UzCrawl We scanned and OCRed more than 35,000 high-quality books in Uzbek. Overall, the dataset size is 33 GB, plus 3 GB of crawled data, like news and articles. It's the biggest high-quality dataset in Uzbek to this date.
	A two-level morphological analyzer for Uzbek Mukhammadsaid Mamasaidov In draft, 2022 video Using finite-state transducers I implemented a morphological analyzer for Uzbek. Then, I formed a team and created a soft mobile keyboard only for Uzbek, with its agglutinative nature in mind. We won 50k$ at mGovAward 2022 as the best mobile application for the government.
	UZWORDNET: A Lexical-Semantic Database for the Uzbek Language Alessandro Agostini, Timur Usmanov, Ulugbek Khamdamov, Nilufar Abdurakhmonova Mukhammadsaid Mamasaidov GWC, 2021 project page / ACL We describe the initial development of a “word-net” for the Uzbek language compatible to Princeton WordNet. To the best of our knowledge, it is the largest wordnet for Uzbek existing to date, and the second wordnet developed overall.

Feel free to steal this website's source code.