Skip to content

F2LLM-v2 results on Thai and Spanish benchmark#465

Merged
KennethEnevoldsen merged 14 commits intoembeddings-benchmark:mainfrom
Geralt-Targaryen:main
Apr 23, 2026
Merged

F2LLM-v2 results on Thai and Spanish benchmark#465
KennethEnevoldsen merged 14 commits intoembeddings-benchmark:mainfrom
Geralt-Targaryen:main

Conversation

@Geralt-Targaryen
Copy link
Copy Markdown
Contributor

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

As discussed in embeddings-benchmark/mteb#3237 and embeddings-benchmark/mteb#4321, I evaluated the F2LLM-v2 models on the Thai and Spanish benchmarks.

MKQARetrieval, WisesightSentimentClassification.v2, and SpanishSentimentClassification.v2 are evaluated using these prompts:

{
    "MKQARetrieval": "Given a question, retrieve the answer.",
    "WisesightSentimentClassification.v2": "Classify the sentiment of the given text.",
    "SpanishSentimentClassification.v2": "Classify the sentiment of the given text."
}

Other tasks are evaluated using the prompts in the current implementation.

I have two related questions:

  1. Do I need to create a PR on model implementation to add relevant prompts each time new tasks are evaluated? Or does it suffice to show the prompts here?
  2. For certain tasks that have existing results on other languages (e.g., MIRACLReranking), when I evaluated the models on the Thai benchmark, the existing results were erased. But then I evaluated the models on the Spanish benchmark, and the Thai results were not erased. I presume this is probably because different MTEB versions were used (existing results use v2.6.7; now I'm using v2.12.4). Currently I solved this issue by manually merging the new results into the existing file and setting the mteb_version field to the latter one. However, I think it would be better to automatically resolve this conflict.

@Geralt-Targaryen Geralt-Targaryen changed the title F2LLM-v2 results and Thai and Spanish benchmark F2LLM-v2 results on Thai and Spanish benchmark Mar 31, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 31, 2026

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: codefuse-ai/F2LLM-v2-0.6B, codefuse-ai/F2LLM-v2-1.7B, codefuse-ai/F2LLM-v2-14B, codefuse-ai/F2LLM-v2-160M, codefuse-ai/F2LLM-v2-330M, codefuse-ai/F2LLM-v2-4B, codefuse-ai/F2LLM-v2-80M, codefuse-ai/F2LLM-v2-8B
Tasks: MIRACLReranking, MIRACLRetrievalHardNegatives.v2, MKQARetrieval, MLSUMClusteringP2P, MLSUMClusteringS2S, MTOPDomainClassification, MTOPIntentClassification, MintakaRetrieval, MrTidyRetrieval, MultiLongDocReranking, MultiLongDocRetrieval, SIB200Classification, STS22, STSBenchmarkMultilingualSTS, SpanishNewsClassification.v2, SpanishPassageRetrievalS2P, SpanishPassageRetrievalS2S, SpanishSentimentClassification.v2, WebFAQRetrieval, WisesightSentimentClassification.v2, XPQARetrieval, XQuADRetrieval

Results for codefuse-ai/F2LLM-v2-0.6B

task_name codefuse-ai/F2LLM-v2-0.6B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
MIRACLReranking 0.5971 nan 0.6249 0.6851 cl-nagoya/ruri-v3-310m True
MIRACLRetrievalHardNegatives.v2 0.5974 nan 0.5333 0.7836 jinaai/jina-embeddings-v5-text-small True
MKQARetrieval 0.1730 nan nan 0.1835 tencent/KaLM-Embedding-Gemma3-12B-2511 False
MLSUMClusteringP2P 0.6919 nan 0.4494 0.5156 Salesforce/SFR-Embedding-2_R False
MLSUMClusteringS2S 0.6856 nan 0.455 0.5103 Salesforce/SFR-Embedding-2_R False
MTOPDomainClassification 0.9906 0.9777 0.9024 0.9995 voyageai/voyage-3-m-exp True
MTOPIntentClassification 0.9370 nan 0.672 0.9307 BAAI/bge-multilingual-gemma2 True
MintakaRetrieval 0.3229 nan 0.3037 0.6253 BAAI/bge-multilingual-gemma2 False
MrTidyRetrieval 0.5959 nan 0.6509 0.7737 intfloat/multilingual-e5-large-instruct True
MultiLongDocReranking 0.9112 nan 0.8887 0.9338 cl-nagoya/ruri-v3-310m False
MultiLongDocRetrieval 0.3266 nan 0.3175 0.4626 cl-nagoya/ruri-v3-30m False
SIB200Classification 0.9562 nan 0.7339 0.8467 tencent/KaLM-Embedding-Gemma3-12B-2511 False
STS22 0.6661 nan 0.6365 0.8314 OrdalieTech/Solon-embeddings-mini-beta-1.1 True
STSBenchmarkMultilingualSTS 0.8434 nan 0.842 0.9596 Gameselo/STS-multilingual-mpnet-base-v2 False
SpanishNewsClassification.v2 0.8962 nan 0.8862 0.8862 intfloat/multilingual-e5-large False
SpanishPassageRetrievalS2P 0.3583 nan 0.4196 0.4402 BAAI/bge-m3 False
SpanishPassageRetrievalS2S 0.7418 nan 0.7232 0.7516 intfloat/e5-mistral-7b-instruct False
SpanishSentimentClassification.v2 0.9343 nan 0.9241 0.9628 intfloat/multilingual-e5-large-instruct False
WebFAQRetrieval 0.8259 nan 0.7596 0.7953 jinaai/jina-embeddings-v3 False
WisesightSentimentClassification.v2 0.3655 nan nan 0.4010 google/embeddinggemma-300m False
XPQARetrieval 0.5184 nan 0.434 0.7742 BAAI/bge-multilingual-gemma2 False
XQuADRetrieval 0.9319 nan 0.9687 0.9709 telepix/PIXIE-Rune-v1.0 False
Average 0.6758 0.9777 0.6563 0.7283 nan -

Model have high performance on these tasks: MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,MLSUMClusteringP2P,MLSUMClusteringS2S

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA


Results for codefuse-ai/F2LLM-v2-1.7B

task_name codefuse-ai/F2LLM-v2-1.7B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
MIRACLReranking 0.6168 nan 0.6249 0.6851 cl-nagoya/ruri-v3-310m True
MIRACLRetrievalHardNegatives.v2 0.6263 nan 0.5333 0.7836 jinaai/jina-embeddings-v5-text-small True
MKQARetrieval 0.2651 nan nan 0.1835 tencent/KaLM-Embedding-Gemma3-12B-2511 False
MLSUMClusteringP2P 0.7047 nan 0.4494 0.5156 Salesforce/SFR-Embedding-2_R False
MLSUMClusteringS2S 0.6979 nan 0.455 0.5103 Salesforce/SFR-Embedding-2_R False
MTOPDomainClassification 0.9914 0.9777 0.9024 0.9995 voyageai/voyage-3-m-exp True
MTOPIntentClassification 0.9448 nan 0.672 0.9307 BAAI/bge-multilingual-gemma2 True
MintakaRetrieval 0.4060 nan 0.3037 0.6253 BAAI/bge-multilingual-gemma2 False
MrTidyRetrieval 0.6393 nan 0.6509 0.7737 intfloat/multilingual-e5-large-instruct True
MultiLongDocReranking 0.9274 nan 0.8887 0.9338 cl-nagoya/ruri-v3-310m False
MultiLongDocRetrieval 0.3820 nan 0.3175 0.4626 cl-nagoya/ruri-v3-30m False
SIB200Classification 0.9666 nan 0.7339 0.8467 tencent/KaLM-Embedding-Gemma3-12B-2511 False
STS22 0.6665 nan 0.6365 0.8314 OrdalieTech/Solon-embeddings-mini-beta-1.1 True
STSBenchmarkMultilingualSTS 0.8570 nan 0.842 0.9596 Gameselo/STS-multilingual-mpnet-base-v2 False
SpanishNewsClassification.v2 0.9152 nan 0.8862 0.8862 intfloat/multilingual-e5-large False
SpanishPassageRetrievalS2P 0.4034 nan 0.4196 0.4402 BAAI/bge-m3 False
SpanishPassageRetrievalS2S 0.7711 nan 0.7232 0.7516 intfloat/e5-mistral-7b-instruct False
SpanishSentimentClassification.v2 0.9445 nan 0.9241 0.9628 intfloat/multilingual-e5-large-instruct False
WebFAQRetrieval 0.8439 nan 0.7596 0.7953 jinaai/jina-embeddings-v3 False
WisesightSentimentClassification.v2 0.3840 nan nan 0.4010 google/embeddinggemma-300m False
XPQARetrieval 0.5591 nan 0.434 0.7742 BAAI/bge-multilingual-gemma2 False
XQuADRetrieval 0.9474 nan 0.9687 0.9709 telepix/PIXIE-Rune-v1.0 False
Average 0.7028 0.9777 0.6563 0.7283 nan -

Model have high performance on these tasks: MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,MKQARetrieval

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA


Results for codefuse-ai/F2LLM-v2-14B

task_name codefuse-ai/F2LLM-v2-14B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
MIRACLReranking 0.6248 nan 0.6249 0.6851 cl-nagoya/ruri-v3-310m True
MIRACLRetrievalHardNegatives.v2 0.6479 nan 0.5333 0.7836 jinaai/jina-embeddings-v5-text-small True
MKQARetrieval 0.4634 nan nan 0.1835 tencent/KaLM-Embedding-Gemma3-12B-2511 False
MLSUMClusteringP2P 0.7183 nan 0.4494 0.5156 Salesforce/SFR-Embedding-2_R False
MLSUMClusteringS2S 0.7147 nan 0.455 0.5103 Salesforce/SFR-Embedding-2_R False
MTOPDomainClassification 0.9924 0.9777 0.9024 0.9995 voyageai/voyage-3-m-exp True
MTOPIntentClassification 0.9561 nan 0.672 0.9307 BAAI/bge-multilingual-gemma2 True
MintakaRetrieval 0.6244 nan 0.3037 0.6253 BAAI/bge-multilingual-gemma2 False
MrTidyRetrieval 0.7073 nan 0.6509 0.7737 intfloat/multilingual-e5-large-instruct True
MultiLongDocReranking 0.9316 nan 0.8887 0.9338 cl-nagoya/ruri-v3-310m False
MultiLongDocRetrieval 0.4014 nan 0.3175 0.4626 cl-nagoya/ruri-v3-30m False
SIB200Classification 0.9644 nan 0.7339 0.8467 tencent/KaLM-Embedding-Gemma3-12B-2511 False
STS22 0.6394 nan 0.6365 0.8314 OrdalieTech/Solon-embeddings-mini-beta-1.1 True
STSBenchmarkMultilingualSTS 0.8643 nan 0.842 0.9596 Gameselo/STS-multilingual-mpnet-base-v2 False
SpanishNewsClassification.v2 0.9290 nan 0.8862 0.8862 intfloat/multilingual-e5-large False
SpanishPassageRetrievalS2P 0.4189 nan 0.4196 0.4402 BAAI/bge-m3 False
SpanishPassageRetrievalS2S 0.7844 nan 0.7232 0.7516 intfloat/e5-mistral-7b-instruct False
SpanishSentimentClassification.v2 0.9693 nan 0.9241 0.9628 intfloat/multilingual-e5-large-instruct False
WebFAQRetrieval 0.8704 nan 0.7596 0.7953 jinaai/jina-embeddings-v3 False
WisesightSentimentClassification.v2 0.4169 nan nan 0.4010 google/embeddinggemma-300m False
XPQARetrieval 0.6049 nan 0.434 0.7742 BAAI/bge-multilingual-gemma2 False
XQuADRetrieval 0.9656 nan 0.9687 0.9709 telepix/PIXIE-Rune-v1.0 False
Average 0.7368 0.9777 0.6563 0.7283 nan -

Model have high performance on these tasks: SpanishSentimentClassification.v2,MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,WisesightSentimentClassification.v2,MKQARetrieval

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA


Results for codefuse-ai/F2LLM-v2-160M

task_name codefuse-ai/F2LLM-v2-160M google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
MIRACLReranking 0.5403 nan 0.6249 0.6851 cl-nagoya/ruri-v3-310m True
MIRACLRetrievalHardNegatives.v2 0.5335 nan 0.5333 0.7836 jinaai/jina-embeddings-v5-text-small True
MKQARetrieval 0.0801 nan nan 0.1835 tencent/KaLM-Embedding-Gemma3-12B-2511 False
MLSUMClusteringP2P 0.6487 nan 0.4494 0.5156 Salesforce/SFR-Embedding-2_R False
MLSUMClusteringS2S 0.6397 nan 0.4550 0.5103 Salesforce/SFR-Embedding-2_R False
MTOPDomainClassification 0.9852 0.9777 0.9024 0.9995 voyageai/voyage-3-m-exp True
MTOPIntentClassification 0.9048 nan 0.6720 0.9307 BAAI/bge-multilingual-gemma2 True
MintakaRetrieval 0.2242 nan 0.3037 0.6253 BAAI/bge-multilingual-gemma2 False
MrTidyRetrieval 0.4970 nan 0.6509 0.7737 intfloat/multilingual-e5-large-instruct True
MultiLongDocReranking 0.8935 nan 0.8887 0.9338 cl-nagoya/ruri-v3-310m False
MultiLongDocRetrieval 0.2085 nan 0.3175 0.4626 cl-nagoya/ruri-v3-30m False
SIB200Classification 0.8935 nan 0.7339 0.8467 tencent/KaLM-Embedding-Gemma3-12B-2511 False
STS22 0.6308 nan 0.6365 0.8314 OrdalieTech/Solon-embeddings-mini-beta-1.1 True
STSBenchmarkMultilingualSTS 0.7914 nan 0.8420 0.9596 Gameselo/STS-multilingual-mpnet-base-v2 False
SpanishNewsClassification.v2 0.8741 nan 0.8862 0.8862 intfloat/multilingual-e5-large False
SpanishPassageRetrievalS2P 0.3252 nan 0.4196 0.4402 BAAI/bge-m3 False
SpanishPassageRetrievalS2S 0.6471 nan 0.7232 0.7516 intfloat/e5-mistral-7b-instruct False
SpanishSentimentClassification.v2 0.8635 nan 0.9241 0.9628 intfloat/multilingual-e5-large-instruct False
WebFAQRetrieval 0.7480 nan 0.7596 0.7953 jinaai/jina-embeddings-v3 False
WisesightSentimentClassification.v2 0.3366 nan nan 0.4010 google/embeddinggemma-300m False
XPQARetrieval 0.4359 nan 0.4340 0.7742 BAAI/bge-multilingual-gemma2 False
XQuADRetrieval 0.8962 nan 0.9687 0.9709 telepix/PIXIE-Rune-v1.0 False
Average 0.6181 0.9777 0.6563 0.7283 nan -

Model have high performance on these tasks: SIB200Classification,MLSUMClusteringP2P,MLSUMClusteringS2S

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA


Results for codefuse-ai/F2LLM-v2-330M

task_name codefuse-ai/F2LLM-v2-330M google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
MIRACLReranking 0.5824 nan 0.6249 0.6851 cl-nagoya/ruri-v3-310m True
MIRACLRetrievalHardNegatives.v2 0.5890 nan 0.5333 0.7836 jinaai/jina-embeddings-v5-text-small True
MKQARetrieval 0.1404 nan nan 0.1835 tencent/KaLM-Embedding-Gemma3-12B-2511 False
MLSUMClusteringP2P 0.6777 nan 0.4494 0.5156 Salesforce/SFR-Embedding-2_R False
MLSUMClusteringS2S 0.6736 nan 0.455 0.5103 Salesforce/SFR-Embedding-2_R False
MTOPDomainClassification 0.9893 0.9777 0.9024 0.9995 voyageai/voyage-3-m-exp True
MTOPIntentClassification 0.9293 nan 0.672 0.9307 BAAI/bge-multilingual-gemma2 True
MintakaRetrieval 0.2795 nan 0.3037 0.6253 BAAI/bge-multilingual-gemma2 False
MrTidyRetrieval 0.5738 nan 0.6509 0.7737 intfloat/multilingual-e5-large-instruct True
MultiLongDocReranking 0.9153 nan 0.8887 0.9338 cl-nagoya/ruri-v3-310m False
MultiLongDocRetrieval 0.3155 nan 0.3175 0.4626 cl-nagoya/ruri-v3-30m False
SIB200Classification 0.9497 nan 0.7339 0.8467 tencent/KaLM-Embedding-Gemma3-12B-2511 False
STS22 0.6634 nan 0.6365 0.8314 OrdalieTech/Solon-embeddings-mini-beta-1.1 True
STSBenchmarkMultilingualSTS 0.8322 nan 0.842 0.9596 Gameselo/STS-multilingual-mpnet-base-v2 False
SpanishNewsClassification.v2 0.9013 nan 0.8862 0.8862 intfloat/multilingual-e5-large False
SpanishPassageRetrievalS2P 0.3719 nan 0.4196 0.4402 BAAI/bge-m3 False
SpanishPassageRetrievalS2S 0.7415 nan 0.7232 0.7516 intfloat/e5-mistral-7b-instruct False
SpanishSentimentClassification.v2 0.8978 nan 0.9241 0.9628 intfloat/multilingual-e5-large-instruct False
WebFAQRetrieval 0.8058 nan 0.7596 0.7953 jinaai/jina-embeddings-v3 False
WisesightSentimentClassification.v2 0.3294 nan nan 0.4010 google/embeddinggemma-300m False
XPQARetrieval 0.4852 nan 0.434 0.7742 BAAI/bge-multilingual-gemma2 False
XQuADRetrieval 0.9243 nan 0.9687 0.9709 telepix/PIXIE-Rune-v1.0 False
Average 0.6622 0.9777 0.6563 0.7283 nan -

Model have high performance on these tasks: SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,MLSUMClusteringP2P,MLSUMClusteringS2S

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA


Results for codefuse-ai/F2LLM-v2-4B

task_name codefuse-ai/F2LLM-v2-4B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
MIRACLReranking 0.6149 nan 0.6249 0.6851 cl-nagoya/ruri-v3-310m True
MIRACLRetrievalHardNegatives.v2 0.6320 nan 0.5333 0.7836 jinaai/jina-embeddings-v5-text-small True
MKQARetrieval 0.3655 nan nan 0.1835 tencent/KaLM-Embedding-Gemma3-12B-2511 False
MLSUMClusteringP2P 0.7067 nan 0.4494 0.5156 Salesforce/SFR-Embedding-2_R False
MLSUMClusteringS2S 0.7046 nan 0.455 0.5103 Salesforce/SFR-Embedding-2_R False
MTOPDomainClassification 0.9924 0.9777 0.9024 0.9995 voyageai/voyage-3-m-exp True
MTOPIntentClassification 0.9508 nan 0.672 0.9307 BAAI/bge-multilingual-gemma2 True
MintakaRetrieval 0.5215 nan 0.3037 0.6253 BAAI/bge-multilingual-gemma2 False
MrTidyRetrieval 0.6868 nan 0.6509 0.7737 intfloat/multilingual-e5-large-instruct True
MultiLongDocReranking 0.9301 nan 0.8887 0.9338 cl-nagoya/ruri-v3-310m False
MultiLongDocRetrieval 0.3903 nan 0.3175 0.4626 cl-nagoya/ruri-v3-30m False
SIB200Classification 0.9658 nan 0.7339 0.8467 tencent/KaLM-Embedding-Gemma3-12B-2511 False
STS22 0.6711 nan 0.6365 0.8314 OrdalieTech/Solon-embeddings-mini-beta-1.1 True
STSBenchmarkMultilingualSTS 0.8541 nan 0.842 0.9596 Gameselo/STS-multilingual-mpnet-base-v2 False
SpanishNewsClassification.v2 0.9242 nan 0.8862 0.8862 intfloat/multilingual-e5-large False
SpanishPassageRetrievalS2P 0.4195 nan 0.4196 0.4402 BAAI/bge-m3 False
SpanishPassageRetrievalS2S 0.7964 nan 0.7232 0.7516 intfloat/e5-mistral-7b-instruct False
SpanishSentimentClassification.v2 0.9650 nan 0.9241 0.9628 intfloat/multilingual-e5-large-instruct False
WebFAQRetrieval 0.8576 nan 0.7596 0.7953 jinaai/jina-embeddings-v3 False
WisesightSentimentClassification.v2 0.4126 nan nan 0.4010 google/embeddinggemma-300m False
XPQARetrieval 0.5815 nan 0.434 0.7742 BAAI/bge-multilingual-gemma2 False
XQuADRetrieval 0.9593 nan 0.9687 0.9709 telepix/PIXIE-Rune-v1.0 False
Average 0.7229 0.9777 0.6563 0.7283 nan -

Model have high performance on these tasks: SpanishSentimentClassification.v2,MTOPIntentClassification,SpanishNewsClassification.v2,SIB200Classification,WebFAQRetrieval,SpanishPassageRetrievalS2S,MLSUMClusteringP2P,MLSUMClusteringS2S,WisesightSentimentClassification.v2,MKQARetrieval

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, `StackExchangeClus


Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Mar 31, 2026

Can you add prompts to your model meta?

@Geralt-Targaryen
Copy link
Copy Markdown
Contributor Author

I've updated the model meta. Do I need to created a PR to add the new prompts to model implementation as well?

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Mar 31, 2026

Do I need to created a PR to add the new prompts to model implementation as well?

Yes

@KennethEnevoldsen KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Apr 5, 2026
@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

For certain tasks that have existing results on other languages (e.g., MIRACLReranking), when I evaluated the models on the Thai benchmark, the existing results were erased. But then I evaluated the models on the Spanish benchmark, and the Thai results were not erased. I presume this is probably because different MTEB versions were used (existing results use v2.6.7; now I'm using v2.12.4). Currently I solved this issue by manually merging the new results into the existing file and setting the mteb_version field to the latter one. However, I think it would be better to automatically resolve this conflict.

You should not automatically merge these - they were run using different MTEB versions and therefore shouldn't be merged (it makes reproducing the results potentially impossible).

A solution right now it rerun the full set with your current version or downgrading the version.

I do however, agree that this is not ideal, we would need to move the MTEB version to the individual subsets to avoid this issue. I would be more than happy to see an issue on this to outline the problem

@Geralt-Targaryen
Copy link
Copy Markdown
Contributor Author

I've created an issue. I'll rerun the models on all subsets.

The prompts have been added to the implementation btw (embeddings-benchmark/mteb#4336)

@Geralt-Targaryen
Copy link
Copy Markdown
Contributor Author

Rerunning the full set would take a really long time for some tasks. So I've rerun the Thai and Spanish subsets using the older version instead. The results are identical anyway. @KennethEnevoldsen

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.

@github-actions github-actions Bot added the stale label Apr 22, 2026
@Geralt-Targaryen Geralt-Targaryen mentioned this pull request Apr 22, 2026
6 tasks
@github-actions github-actions Bot removed the stale label Apr 23, 2026
@KennethEnevoldsen KennethEnevoldsen merged commit 6655497 into embeddings-benchmark:main Apr 23, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting for review of implementation This PR is waiting for an implementation review before merging the results.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants