AI-ML·중요도 5·2026. 05. 21.·r/MachineLearning

Do VLMs in production still use fixed-patch ViTs for their vision capabilities? [D]

── KO ──────────────────

비전 모델에서 고정 패치 ViT 사용에 대한 의문 제기.

연구 커뮤니티는 비전을 위한 더 효율적이고 효과적인 토크나이저를 제공해왔습니다. 하지만 주요 모델들이 비고정 패치 토크나이제를 적용하고 있는지는 불확실합니다. 글쓴이는 마진 이익, 효율성을 위한 고정 토큰 수 요구, 입력 적응형 패칭의 이해 부족 등을 이유로 가능성을 의문시하고 있습니다. 이와 관련해 대형 플레이어들이 실제로 동적 토크나이제를 사용하고 있는지에 대한 질문을 던지고 있습니다.

── EN ──────────────────

Questioning the use of fixed-patch ViTs in vision models.

The research community has provided more efficient and effective tokenization methods for vision. However, there is uncertainty about whether major models are adopting non-fixed-patch tokenization. The author raises concerns about marginal gains, the need for a fixed number of tokens for efficiency, and a lack of understanding of scaling laws for adaptive patching. This leads to the question of whether big players are actually utilizing dynamic tokenization in practice.

원문 보기 →목록으로