AI-ML·중요도 6·2026. 05. 18.·r/MachineLearning

Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]

── KO ──────────────────

9.8M 다국어 문서 컬렉션이 무료로 공개되었습니다.

최근 다국어 연구 프로젝트의 일환으로 9.8M개의 웹 문서가 무료로 공개되었습니다. 이 데이터셋은 힌디어, 벵골어, 타밀어, 텔루구어 등 11개 언어로 구성되어 있으며, 총 84억 개의 토큰을 포함하고 있습니다. CC0 라이센스 하에 제공되는 이 데이터는 연구와 개발에 유용하게 사용될 수 있습니다.

── EN ──────────────────

A free 9.8M multilingual document corpus has been released.

Recently, a collection of 9.8 million web documents has been made available for free as part of a multilingual research project. This dataset includes 11 languages, such as Hindi, Bengali, Tamil, Telugu, and consists of approximately 8.4 billion tokens. Provided under the CC0 license, this data can be useful for research and development purposes.

원문 보기 →목록으로