Hong Kong is a multilingual society, but nearly 90% of the population speak Cantonese as a first language. Cantonese is used in both formal and informal settings. Many non-local people living and working in Hong Kong therefore need to learn Cantonese in order to integrate themselves into the local community.
Despite its dominant status, however, Cantonese has never been formalised and implemented into the school curriculum. Consequently, learning and teaching materials and teaching methods vary considerably.
Dr Andy Chin Chi-on, Head and Associate Professor at the Department of Linguistics and Modern Language Studies, The Education University of Hong Kong (EdUHK), proposed a research programme adopting a more scientific and objective approach to promote the learning and teaching Cantonese.
Studies in the past five decades have enriched our understanding of the lexicon, phonology and grammar of Cantonese; yet some deeper issues, such as pragmatics, semantics and discourse, remain to be explored. This kind of research requires a significant amount of authentic and natural language data. The research team thus proposed the construction of a Cantonese corpus to expand the scope of Cantonese linguistic research.
One major advantage of using corpus in language studies is the provision of objective, unbiased quantitative and qualitative data for research and other applications, including the compilation of language materials and natural language processing, such as speech-to-text and text-to-speech algorithms.
The research project started in 2011 with the support of an EdUHK internal research grant and the Early Career Scheme of the Research Grants Council. Dr Chin constructed the corpus in two phases with a size of about one million Chinese characters. The corpus data was collected by transcribing the dialogues of 80 black-and-white movies produced between the 1950s and 1970s, and is now available online.
The corpus won the Gold Medal and Special Award at the Silicon Valley International Invention Festival in 2019. Dr Chin has also developed mobile apps containing the corpus data.
The CanPro app, which enables learners to practise Cantonese pronunciation through commonly used expressions in the corpus, won a Silver Medal at the 2021 Inventions Geneva Evaluation Days.
Another mobile app called ‘Learn Cantonese with Big Data’, supported by the Language Fund of the Standing Committee on Language Education and Research, was launched in March 2022. One major feature of this app is the provision of linguistic information that Cantonese learners might find relevant and useful, such as the collocation of verb-noun, classifier-noun structures, which cannot be obtained without corpus data.