NeMo Multi-Scale Diarization Decoder

NeMo Multi-Scale Diarization Decoder

2025. 8. 12. 11:26ㆍ연구하기, 지식

Speaker Diarization Task는 정말 어려운 분야 같다. 괜찮은 모델을 찾고 파인튜닝을 해도 오디오의 음질이나 도메인에 너무 큰 영향을 받는다. 도출된 결과로 LLM을 통해 회의록을 만들기라도 하면 화자 분리가 하나라도 잘못되면 재수 없게 회의록은 엉망이 되고 만다.

그러던 중 상당히 흥미로운 모델을 찾았다. NeMo Framework에 Speaker Diarization이다. 'Multi-Scale Diarization Decoder' 이란 명칭이 붙어있고 논문에 따른 성능이나, 도메인에 따른 Pretrained Model 지원 여부, 편리한 학습을 제공한다. 많은 곳에서 그 동안은 많은 곳에서 그랬던 것처럼 Pyannote를 파인튜닝 후 사용 중이었는데 결과가 놀랍다.

논문

https://arxiv.org/abs/2203.15974

Multi-scale Speaker Diarization with Dynamic Scale Weighting

Speaker diarization systems are challenged by a trade-off between the temporal resolution and the fidelity of the speaker representation. By obtaining a superior temporal resolution with an enhanced accuracy, a multi-scale approach is a way to cope with su

arxiv.org

https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_diarization/models.html

Models — NVIDIA NeMo Framework User Guide

Models This section gives a brief overview of the supported speaker diarization models in NeMo’s ASR collection. Currently NeMo Speech AI supports two types of speaker diarization systems: 1. End-to-end Speaker Diarization: Sortformer Diarizer Sortformer

docs.nvidia.com

논문에서 볼 수 있 듯 성능이 매우 뛰어나다. 자사의 TitaNet 임베딩 모델을 활용하여 임베딩을 내부에서 진행하고 있다. TitaNet의 성능도 우수하니 Diarization도 성능이 잘 나오는 것 같다.

두가지 도메인에 대하 DER 들이 1% 대에 머문다. DER 관련 지표를 볼 때 1% 대의 지표를 보는 것은 참 드문 일이다. 또한 이런 성능 때문일까? 흥미롭게도 Diarization이 매우 힘든 환경인 교실에서의 성능 관련 논문도 존재한다.

https://arxiv.org/abs/2505.10879

Multi-Stage Speaker Diarization for Noisy Classrooms

Speaker diarization, the process of identifying "who spoke when" in audio recordings, is essential for understanding classroom dynamics. However, classroom settings present distinct challenges, including poor recording quality, high levels of background no

arxiv.org

교실에서는 음성 겹침, 배경 소음 및 울림, 그리고 탐지하기 힘든 아이들의 발화 등 다양한 Pain 요소들을 가지고 있다. 이런 특성들을 디노이징 등 을 이용해 해결한 것으로 보인다. 음성 관련 AI 일을 하는 사람으로써 꼭 한번 읽어보고 제대로 리뷰 해야겠다.

이 논문에서도 NeMo Diarization 모델을 Pyannote와 비교했다.

Pyannote 대비 성능이 좋아 보인다. 내가 테스트 해봐야겠다.

코드

def speaker_diarization(domain_type, audio_path, output_dir='path/to/output', rttm_path=None, evaluate=False, config_path='/path/to/config'):
    if evaluate == True and rttm_path == None:
        raise ValueError("eavluate=Ture 시 rttm_path가 입력되어야 합니다.")

    CONFIG_FILE_NAME = f"diar_infer_{domain_type}.yaml"
    # current_time = datetime.now().strftime('%Y_%m_%d_%H:%M:%S')
    # data_dir = os.path.join(output_dir, current_time)
    os.makedirs(output_dir, exist_ok=True)

    CONFIG_URL = f"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/inference/{CONFIG_FILE_NAME}"

    if not os.path.exists(os.path.join(config_path,CONFIG_FILE_NAME)):
        CONFIG = wget.download(CONFIG_URL, config_path)
    else:
        CONFIG = os.path.join(config_path,CONFIG_FILE_NAME)

    cfg = OmegaConf.load(CONFIG)

    meta = {
        'audio_filepath': audio_path,
        'offset': 0,
        'duration':None,
        'label': 'infer',
        'text': '-',
        'num_speakers': None,
        'rttm_filepath': rttm_path, # oracle_vad가 True일 때 rttm 정답지 필요.
        'uem_filepath' : None
    }

    with open(os.path.join(output_dir,'input_manifest.json'),'w') as fp:
        json.dump(meta,fp)
        fp.write('\n')

    cfg.diarizer.manifest_filepath = os.path.join(output_dir,'input_manifest.json')
    cfg.diarizer.speaker_embeddings.model_path = 'titanet_large'
    cfg.diarizer.manifest_filepath = cfg.diarizer.manifest_filepath
    cfg.diarizer.out_dir = output_dir
    cfg.diarizer.clustering.parameters.oracle_num_speakers=False

    if evaluate == False:
        cfg.diarizer.vad.model_path = 'vad_multilingual_marblenet'
        cfg.diarizer.asr.model_path = 'stt_en_conformer_ctc_large'
        cfg.diarizer.oracle_vad = False
        cfg.diarizer.asr.parameters.asr_based_vad = False

        asr_decoder_ts = ASRDecoderTimeStamps(cfg.diarizer)
        asr_model = asr_decoder_ts.set_asr_model()
        word_hyp, word_ts_hyp = asr_decoder_ts.run_ASR(asr_model)
        asr_diar_offline = OfflineDiarWithASR(cfg.diarizer)
        asr_diar_offline.word_ts_anchor_offset = asr_decoder_ts.word_ts_anchor_offset

        diar_hyp, diar_score = asr_diar_offline.run_diarization(cfg, word_ts_hyp)

    elif evaluate == True:
        cfg.diarizer.speaker_embeddings.parameters.window_length_in_sec = [1.5,1.25,1.0,0.75,0.5]
        cfg.diarizer.speaker_embeddings.parameters.shift_length_in_sec = [0.75,0.625,0.5,0.375,0.1]
        cfg.diarizer.speaker_embeddings.parameters.multiscale_weights= [1,1,1,1,1]
        cfg.diarizer.oracle_vad = True

        oracle_vad_clusdiar_model = ClusteringDiarizer(cfg=cfg)

        oracle_vad_clusdiar_model.diarize()

Pyannote 파이프라인 보다 코드가 퍽 복잡해보이지만 별 거 없다. 대부분이 Config를 불러오는 과정이고 Evaluate 기능도 넣어놔서 그렇다. Evaluate 시에는 오라클의 모델을 사용한다.

성능

사실 DER에 관한 수치는 이미 많은 곳에 공개 되어있다. 논문들 속에서도 그렇고 PapersWithCode의 Speaker Diarization Task 에서도 그 성능 비교는 마쳐 있다. 내가 굳이 비교할 필요는 없을 것 같아서 특이한 방법으로 접근 해보겠다. 단순 DER 이 아니라 최종 결과물에 대해서 유니크한 화자 수를 계산 해 화자 수 추정 면에서 확인 해보겠다. 개인 데이터 셋으로 비교해 본 결과는 다음과 같다. Pyannote는 Pretrained Pipeline과 개인적으로 Finetuning을 한 모델 둘 다 비교 해보겠다.

노란색은 화자 수 추정 성공, 초록색은 모두 추정 실패 시 한 명의 오차를 가진 결과, 빨간색은 완전 실패이다. 단순 표로 보면 한눈에 들어오지 않으니 원 그래프로 한번 봐보겠다.

Pyannote를 학습 하건 안하건 화자 수 추정 쪽에서도 NeMo Diarization이 압도적인 성능을 자랑했다. 결론을 표로 정리하면 다음과 같다.

결론

NeMo Framework는 이 외에도 다양한 음성 AI 모델들을 무료로 공개하고 학습도 가능하게 해준다. 양질의 논문들도 개방해 어떻게 결과에 도달했는지도 알 수 있다. NVIDIA의 이런 달란트를 받아먹고 산다는게 축복 같다. 주식을 더 사야겠다.

728x90

저작자표시 비영리 동일조건 (새창열림)

'연구하기, 지식' 카테고리의 다른 글

uv 사용해보기 (5)	2025.08.29
파이썬에서 깔끔하게 임시 파일 관리하기 (2)	2025.08.10
Supabase DB 사용해보기 (3)	2025.07.06
Gemma3 - 무료 멀티모달 LLM 모델 사용하기 (3)	2025.06.29
JIT 컴파일과 데코레이터 (2)	2025.06.09

태그

최근글

댓글

아카이브

논문

코드

성능

결론

'연구하기, 지식' 카테고리의 다른 글

관련글

티스토리툴바