Recently, the China Cybersecurity Association officially launched a Chinese internet corpus resource platform aimed at providing users with a wide range of Chinese internet corpus resources. This platform supports multiple tagging categories including industry sectors, content modalities, and size scales to facilitate user-specific downloads and usage.
The released corpus resources were developed under the guidance of the Cyberspace Administration of China by the China Cybersecurity Association in collaboration with the National Internet Emergency Center. Building on the previously released 1.0 version of the Chinese internet foundational corpus, this update incorporates new high-quality trustworthy data through a collaborative sharing mechanism. The data underwent rigorous processes such as source selection, content filtering, and deduplication, resulting in the 2.0 version of the Chinese internet foundational corpus, which totals 120GB and contains 38 million entries.
In addition, the platform hosts 27 corpus datasets amounting to approximately 2.7TB. These datasets are categorized into three types: first, foundational Chinese internet corpora jointly built by the China Cybersecurity Association and the National Internet Emergency Center; second, shared internet corpora from entities like People's Daily Online, Beijing Academy of Artificial Intelligence, Shanghai AI Laboratory, among others; third, high-quality Chinese foundational corpus samples provided by institutions such as the China Network Space Research Institute, the National Version Library of China, the Encyclopaedia of China Publishing House, and the Library of the Chinese Academy of Social Sciences.
Users can visit the official website of the China Cybersecurity Association, click on the link for the "Chinese Internet Corpus Resource Platform," complete registration and authentication procedures, and then download the required corpus resources.
It is reported that the Artificial Intelligence Security Governance Committee of the China Cybersecurity Association will continue to enhance the development of Chinese internet foundational corpora, providing strong support for AI technology innovation and industrial development. This move signifies an important step forward in the sharing and utilization of Chinese internet corpus resources, contributing to further advancements in AI technology.