AI-ML·중요도 6·2026. 05. 17.·r/MachineLearning

How are you handling training data when public datasets don't match your use case? [D]

── KO ──────────────────

공공 데이터셋이 맞지 않을 때의 데이터 처리 방법에 대한 논의.

공공 데이터셋은 종종 너무 일반적이거나 잘못된 도메인일 수 있습니다. 이러한 한계를 극복하기 위한 방법으로는 성능 저하를 감수하고 현 데이터셋을 사용하는 것, 스크래핑과 클리닝을 통한 데이터 수집, 혹은 SMOTE 같은 데이터 증강 기법이 있습니다. 저자는 실질적인 데이터를 수집하고 특정 스키마에 맞춘 후, 합성 데이터를 지닌 모델을 제안하고 있습니다. 독자들에게 데이터 처리의 어려움과 해결책에 대한 의견을 구하고 있습니다.

── EN ──────────────────

Discussion on handling training data when public datasets are inadequate.

Public datasets can often be too generic, belong to the wrong domain, or lack sufficient volume. Solutions discussed include accepting degraded performance with current datasets, spending time scraping and cleaning data, or using augmentation techniques like SMOTE. The author proposes a different method involving sourcing permissively licensed data, curating it to a specific schema, and creating synthetic data for coverage. They seek feedback from the community on their experiences with similar data challenges.

원문 보기 →목록으로