#benchmark

AI가 선별한 아티클

7·ai-ml·분석·r/MachineLearning·2026. 05. 25.

The famous METR AI time horizons graph contains numerous severe errors [D]

METR AI 그래프에 심각한 오류가 있다고 비판하는 글이다.

A critique highlights serious errors in the METR AI graph's data reliability.

#metr #benchmark #data #ai #research

요약 보기 원문 →

6·ai-ml·기타·r/MachineLearning·2026. 05. 22.

One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]

벤치마크 성능이 실제 운영 환경에서의 워크플로우 생존성과 거의 무관하다는 주장을 다룬 글입니다.

The article argues that benchmark performance often fails to predict workflow survival in production environments.

#benchmark #performance #user intent #workflow #evaluation

요약 보기 원문 →

8·ai-ml·릴리즈·GeekNews·2026. 05. 19.

Gemini 3.5 Flash

Gemini 3.5 Flash는 프런티어급 지능을 갖춘 모델입니다.

Gemini 3.5 Flash is a new AI model with frontier-level intelligence.

#gemini #ai #coding #performance #benchmark

요약 보기 원문 →

6·ai-ml·기타·GeekNews·2026. 05. 18.

whichllm - 내 하드웨어에서 실제로 돌아가고 최고 성능을 내는 로컬 LLM 찾기

로컬 LLM을 최적화하여 추천하는 CLI 도구, whichllm.

whichllm recommends optimized local LLMs based on user hardware via CLI.

#huggingface #llm #benchmark #cli #nvidia

요약 보기 원문 →

7·ai-ml·분석·r/MachineLearning·2026. 05. 17.

#1 on memory benchmark LongMemEval with Gemini Flash, not Pro [R]

Gemini Flash가 LongMemEval에서 최고 성능을 기록했습니다.

Gemini Flash achieved top performance in LongMemEval.

#gemini #memory #retrieval #benchmark #temporal salience

요약 보기 원문 →

6·ai-ml·기타·r/MachineLearning·2026. 05. 12.

Cache-testing software for LLM-provider-style tiered ephemeral caches? [D]

LLM 제공자를 위한 계층 캐시 테스트 소프트웨어를 찾고 있습니다.

Looking for cache-testing software for tiered ephemeral caches used by LLM providers.

#libcachesim #llm #caching #benchmark #simulation

요약 보기 원문 →

6·ai-ml·분석·GeekNews·2026. 05. 10.

LLM은 위임할 때 문서를 훼손한다

위임형 워크플로에서 LLM의 문서 충실성을 평가하는 DELEGATE-52 벤치마크 소개

DELEGATE-52 evaluates document fidelity in delegated editing tasks using LLM.

#llm #benchmark #editing #delegation #document

요약 보기 원문 →

7·ai-ml·분석·r/MachineLearning·2026. 05. 09.

LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

LLM 성능 평가를 위한 지향 그래프를 구축한 실험 결과를 다룬 기사입니다.

The article discusses experimental results of an LLM benchmark graph for model evaluation.

#llm #benchmark #graph #intelligence index #model evaluation

요약 보기 원문 →

7·ai-ml·분석·OpenAI Blog·2026. 02. 26.

Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting

OpenAI와 PNNL이 AI를 활용한 연방 허가 가속화 벤치마크를 소개했습니다.

OpenAI and PNNL introduce a benchmark for AI in accelerating federal permitting.

#openai #benchmark #nepa #ai #infrastructure

요약 보기 원문 →

8·ai-ml·릴리즈·OpenAI Blog·2025. 11. 03.

Introducing IndQA

OpenAI가 인도 언어 AI 시스템 평가를 위한 IndQA 벤치마크를 발표했습니다.

OpenAI introduces IndQA, a benchmark for evaluating AI systems in Indian languages.

#openai #indqa #ai #benchmark #cultural understanding

요약 보기 원문 →

7·ai-ml·기타·OpenAI Blog·2025. 05. 12.

Introducing HealthBench

HealthBench는 의료 AI 평가를 위한 새로운 벤치마크입니다.

HealthBench is a new benchmark for evaluating AI in healthcare.

#ai #healthcare #benchmark #model #performance

요약 보기 원문 →

8·ai-ml·릴리즈·OpenAI Blog·2023. 03. 14.

GPT-4

OpenAI의 GPT-4는 이미지와 텍스트 입력을 처리하는 대형 다중모달 모델이다.

OpenAI's GPT-4 is a large multimodal model that processes image and text inputs.

#gpt-4 #openai #deep learning #multimodal #benchmark

요약 보기 원문 →

모든 아티클을 불러왔습니다.