KoSimCSE

: Simple Contrastive Learning of Korean Sentence Embeddings Implementation

SimCSE를 한국어 문장으로 파인튜닝 시킨 모델.

SKT KoBART 모델에 카카오브레인의 KorNLU 데이터셋으로 Suvervised만 학습시켰다고 함.

- github 링크: https://github.com/BM-K/KoSimCSE-SKT

GitHub - BM-K/KoSimCSE-SKT: Simple Contrastive Learning of Korean Sentence Embeddings

Simple Contrastive Learning of Korean Sentence Embeddings - GitHub - BM-K/KoSimCSE-SKT: Simple Contrastive Learning of Korean Sentence Embeddings

github.com

Model
- SKT KoBERT
Dataset
- kakaobrain NLU dataset
- train: KorNLI
- dev & test: KorSTS
Setting
    * epochs: 3
    * dropout: 0.1
    * batch size: 256
    * temperature: 0.05
    * learning rate: 1e-4
    * warm-up ratio: 0.05
    * max sequence length: 50
    * evaluation steps during training: 250

Pre-Trained Models

BERT pooled [CLS] token representation 사용
풀링되지 않은 [CLS] token representation만 사용하는 것이 더 나을 수도 있음

Downstream Task - 1) Semantic Search

import numpy as np
from model.utils import pytorch_cos_sim
from data.dataloader import convert_to_tensor, example_model_setting


def main():
    model_ckpt = './output/nli_checkpoint.pt'
    model, transform, device = example_model_setting(model_ckpt)

    # Corpus with example sentences
    corpus = ['한 남자가 음식을 먹는다.',
              '한 남자가 빵 한 조각을 먹는다.',
              '그 여자가 아이를 돌본다.',
              '한 남자가 말을 탄다.',
              '한 여자가 바이올린을 연주한다.',
              '두 남자가 수레를 숲 속으로 밀었다.',
              '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.',
              '원숭이 한 마리가 드럼을 연주한다.',
              '치타 한 마리가 먹이 뒤에서 달리고 있다.']

    inputs_corpus = convert_to_tensor(corpus, transform)

    corpus_embeddings = model.encode(inputs_corpus, device)

    # Query sentences:
    queries = ['한 남자가 파스타를 먹는다.',
               '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.',
               '치타가 들판을 가로 질러 먹이를 쫓는다.']

    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = 5
    for query in queries:
        query_embedding = model.encode(convert_to_tensor([query], transform), device)
        cos_scores = pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
        cos_scores = cos_scores.cpu().detach().numpy()

        top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

        print("\n\n======================\n\n")
        print("Query:", query)
        print("\nTop 5 most similar sentences in corpus:")

        for idx in top_results[0:top_k]:
            print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

SimCSE: Simple Contrastive Learning of Sentence Embeddings

- [EMNLP 2021]

- 논문 링크 : https://arxiv.org/abs/2104.08821

This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with o

arxiv.org

Abstract

SimCSE 란?

한마디로 말하면 simple contrastive learning framework로, state-of-the-art sentence embeddings의 발전을 꾀했다.

논문에서는 unsupervised와 supervised 2가지 접근법을 제안하고 있다.

Figure 1: (a) Unsupervised SimCSE predicts the input sentence itself from in-batch negatives, with different hidden dropout masks applied. (b) Supervised SimCSE leverages the NLI datasets and takes the entailment (premisehypothesis) pairs as positives, and contradiction pairs as well as other in-batch instances as negatives.

1) unsupervised approach

먼저 입력 문장을 받아 standard dropout만 노이즈로 사용하여 contrastive objective에서 예측하는 unsupervised approach 방식을 설명한다.

이 간단한 방법은 놀라울 정도로 잘 작동하며 이전의 지도 방식과 동등한 성능을 보인다고 한다.

dropout은 최소한의 data augmentation 역할을 하며, 이를 제거하면 representation collapse로 이어진다는 사실을 발견했다고 함.

2) supervised approach

"entailment" 쌍을 긍정으로, "contradiction" 쌍을 부정으로 사용하여

natural language inference datasets의 incorporates annotated pairs를 contrastive learning framework에 통합하는 supervised approach방식을 제안한다.

standard semantic textual similarity (STS) 작업에 대해 SimCSE를 평가한 결과,

BERT 기반을 사용한 비지도 및 지도 모델은 각각 평균 76.3%와 81.6%의 스피어만 상관관계를 달성하여

이전 최고 결과보다 4.2%와 2.2% 개선된 결과를 보였다.

또한 contrastive learning objective가 사전 학습된 임베딩의 anisotropic space를 보다 균일하게 정규화하고,

supervised signals를 사용할 수 있을 때 positive pairs를 더 잘 정렬한다는 것을 이론적으로나 경험적으로 입증했다고 함.

6.3 Ablation Studies

Qualitative comparison

SBERT-base와 SimCSE-BERT-base를 가지고 실험을 진행. Flickr30k dataset에서 150k의 캡션을 사용했고, 유사한 문장을 검색하기 위해 쿼리로써 무작위로 문장을 선택. Table 8에서의 데이터처럼, SimCSE에 의해 검색된 문장들이 SBERT와 비교해 더 높은 퀄리티를 가짐

9 Conclusion

SimCSE는 semantic textual similarity task에서 SOTA sentence embeddings를 크게 개선시킴. input sentnece에 droput onise를 얹어 스스로 예측하게 하는 비지도 방식과, NLI dataset를 활용하는 지도 방식 모두를 제안. 다른 base-line model들과 SimCSE의 alignment와 uniformity를 분석함으로써 내부적으로도 증명. 특히 비지도방식은 NLP의 다양한 응용에 활용될 수 있을거라 생각. text input인 data augmentation에 새로운 관점을 제안했고, 이는 다른 conitunous representation으로 확장되고 언어모델로 통합될 수 있을 것.

728x90

저작자표시 비영리 변경금지 (새창열림)

'AI > NLP' 카테고리의 다른 글

[ChatGPT] 영어 논문 빠르게 읽는 팁!! ChatPDF를 이용해 PDF 파일 요약 및 질문하기 (0)	2023.09.27
[LLM] LLM 텍스트 요약 평가 관련 + 논문 리뷰 (0)	2023.09.26
어텐션 매커니즘(Attention Mechanism) (2)	2023.09.21
NLP의 핵심, 트랜스포머(Transformer) 복습! (0)	2023.09.21
워드 임베딩(Word Embedding) - Word2Vec (0)	2023.09.21

Hello, didi universe

[검색] sentence embedding - KoSimCSE / SimCSE 논문리뷰