Skip to content

rerun old results of multilingual-e5 small, medium, large#474

Merged
KennethEnevoldsen merged 3 commits intomainfrom
remove_old_results
Apr 19, 2026
Merged

rerun old results of multilingual-e5 small, medium, large#474
KennethEnevoldsen merged 3 commits intomainfrom
remove_old_results

Conversation

@Samoed
Copy link
Copy Markdown
Member

@Samoed Samoed commented Apr 2, 2026

Ref embeddings-benchmark/mteb#3921

rerun old results of multilingual-e5 small, medium, large

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: BAAI/bge-m3, intfloat/e5-base, intfloat/multilingual-e5-base, intfloat/multilingual-e5-large, intfloat/multilingual-e5-small
Tasks: ArxivClusteringP2P, ArxivClusteringS2S, BUCC, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, Cmnli, DBPedia-PL, DKHateClassification, DiaBlaBitextMining, EightTagsClustering, FEVER, HotpotQA-PL, LEMBNarrativeQARetrieval, LEMBNeedleRetrieval, LEMBQMSumRetrieval, LEMBSummScreenFDRetrieval, LEMBWikimQARetrieval, MSMARCO-PL, NQ-PL, NorwegianParliamentClassification, Ocnli, PpcPC, QBQTC, Quora-PL, ScalaClassification, ThuNewsClusteringP2P, ThuNewsClusteringS2S

Results for BAAI/bge-m3

task_name BAAI/bge-m3 intfloat/multilingual-e5-large Max result Model with max result In Training Data
LEMBNarrativeQARetrieval 0.4879 0.2422 0.7690 lightonai/GTE-ModernColBERT-v1 False
LEMBNeedleRetrieval 0.395 0.28 0.9325 lightonai/GTE-ModernColBERT-v1 False
LEMBQMSumRetrieval 0.3611 0.2426 0.8323 mteb/baseline-bm25s False
LEMBSummScreenFDRetrieval 0.9389 0.7112 0.9784 mteb/baseline-bm25s False
LEMBWikimQARetrieval 0.7946 0.568 0.9988 lightonai/GTE-ModernColBERT-v1 False
Average 0.5955 0.4088 0.9022 nan -

Training datasets: CMedQAv1-reranking, CMedQAv2-reranking, CmedqaRetrieval, CodeSearchNet, DuRetrieval, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, LeCaRDv2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, T2Reranking, T2Retrieval, mMARCO-NL


Results for intfloat/e5-base

task_name google/gemini-embedding-001 intfloat/e5-base intfloat/multilingual-e5-large Max result Model with max result In Training Data
DKHateClassification 0.8702 0.5863 nan 0.8702 google/gemini-embedding-001 False
NorwegianParliamentClassification 0.5672 0.5593 0.5606 0.7007 Qwen/Qwen3-Embedding-8B False
ScalaClassification 0.5185 0.4991 0.5109 0.9112 microsoft/harrier-oss-v1-27b False
Average 0.6520 0.5482 0.5357 0.8274 nan -

Training datasets: MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, mMARCO-NL


Results for intfloat/multilingual-e5-base

task_name google/gemini-embedding-001 intfloat/multilingual-e5-base intfloat/multilingual-e5-large Max result Model with max result In Training Data
ArxivClusteringP2P nan 0.4376 0.4473 0.6092 TencentBAC/Conan-embedding-v2 False
ArxivClusteringS2S nan 0.3631 0.3871 0.5520 TencentBAC/Conan-embedding-v2 False
BUCC nan 0.9818 0.9872 0.9900 intfloat/multilingual-e5-large-instruct False
CLSClusteringP2P nan 0.386 0.3915 0.8225 Bytedance/Seed1.6-embedding False
CLSClusteringS2S nan 0.3686 0.3682 0.7627 tencent/Youtu-Embedding False
CMedQAv1-reranking nan 0.6646 0.6765 0.9434 Kingsoft-LLM/QZhou-Embedding False
CMedQAv2-reranking nan 0.6626 0.6676 0.9444 IEITYuan/Yuan-embedding-2.0-zh False
Cmnli nan 0.6735 0.6983 0.9579 Kingsoft-LLM/QZhou-Embedding-Zh False
DBPedia-PL nan 0.3014 0.3578 0.4319 BAAI/bge-multilingual-gemma2 False
DiaBlaBitextMining 0.8723 0.847 0.8483 0.8882 codefuse-ai/F2LLM-v2-14B False
EightTagsClustering nan 0.2235 0.2652 0.5029 BAAI/bge-multilingual-gemma2 False
FEVER nan 0.7942 0.8281 0.9628 voyageai/voyage-3-m-exp True
HotpotQA-PL nan 0.6347 0.6741 0.7703 BAAI/bge-multilingual-gemma2 True
MSMARCO-PL nan 0.6198 0.6991 0.7269 BAAI/bge-multilingual-gemma2 True
NQ-PL nan 0.4479 0.5282 0.5685 BAAI/bge-multilingual-gemma2 True
NorwegianParliamentClassification 0.5672 0.5609 0.5606 0.7007 Qwen/Qwen3-Embedding-8B False
Ocnli nan 0.5901 0.5864 0.9518 Kingsoft-LLM/QZhou-Embedding-Zh False
PpcPC 0.955 0.883 0.9116 0.9576 microsoft/harrier-oss-v1-27b False
QBQTC nan 0.2831 0.2747 0.6156 Kingsoft-LLM/QZhou-Embedding-Zh False
Quora-PL nan 0.8131 0.837 0.8563 sdadas/mmlw-e5-large False
ScalaClassification 0.5185 0.5089 0.5109 0.9112 microsoft/harrier-oss-v1-27b False
ThuNewsClusteringP2P nan 0.5318 0.5593 0.8976 Kingsoft-LLM/QZhou-Embedding-Zh False
ThuNewsClusteringS2S nan 0.5371 0.5456 0.8955 tencent/Youtu-Embedding False
Average 0.7282 0.5702 0.5918 0.7922 nan -

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL


Results for intfloat/multilingual-e5-large

task_name google/gemini-embedding-001 intfloat/multilingual-e5-large intfloat/multilingual-e5-large Max result Model with max result In Training Data
Revisions ab10c1a7f42e74530fe7ae5be82e6d4f11a719eb external
ArxivClusteringP2P nan 0.4473 0.4431 0.6092 TencentBAC/Conan-embedding-v2 False
ArxivClusteringS2S nan 0.3871 0.3843 0.5520 TencentBAC/Conan-embedding-v2 False
BUCC nan 0.9872 0.9855 0.9900 intfloat/multilingual-e5-large-instruct False
CLSClusteringP2P nan 0.3915 nan 0.8225 Bytedance/Seed1.6-embedding False
CLSClusteringS2S nan 0.3682 nan 0.7627 tencent/Youtu-Embedding False
CMedQAv1-reranking nan 0.6765 nan 0.9434 Kingsoft-LLM/QZhou-Embedding False
CMedQAv2-reranking nan 0.6676 nan 0.9444 IEITYuan/Yuan-embedding-2.0-zh False
Cmnli nan 0.6983 nan 0.9579 Kingsoft-LLM/QZhou-Embedding-Zh False
DBPedia-PL nan 0.3578 nan 0.4319 BAAI/bge-multilingual-gemma2 False
DiaBlaBitextMining 0.8723 0.8483 nan 0.8882 codefuse-ai/F2LLM-v2-14B False
EightTagsClustering nan 0.2652 nan 0.5029 BAAI/bge-multilingual-gemma2 False
FEVER nan 0.8279 0.8281 0.9628 voyageai/voyage-3-m-exp True
HotpotQA-PL nan 0.6741 nan 0.7703 BAAI/bge-multilingual-gemma2 True
MSMARCO-PL nan 0.6991 nan 0.7269 BAAI/bge-multilingual-gemma2 True
NQ-PL nan 0.5282 nan 0.5685 BAAI/bge-multilingual-gemma2 True
NorwegianParliamentClassification 0.5672 0.5606 nan 0.7007 Qwen/Qwen3-Embedding-8B False
Ocnli nan 0.5864 nan 0.9518 Kingsoft-LLM/QZhou-Embedding-Zh False
PpcPC 0.955 0.9116 nan 0.9576 microsoft/harrier-oss-v1-27b False
QBQTC nan 0.2747 nan 0.6156 Kingsoft-LLM/QZhou-Embedding-Zh False
Quora-PL nan 0.8370 nan 0.8563 sdadas/mmlw-e5-large False
ScalaClassification 0.5185 0.5109 nan 0.9112 microsoft/harrier-oss-v1-27b False
ThuNewsClusteringP2P nan 0.5593 nan 0.8976 Kingsoft-LLM/QZhou-Embedding-Zh False
ThuNewsClusteringS2S nan 0.5456 nan 0.8955 tencent/Youtu-Embedding False
Average 0.7282 0.5918 0.6602 0.7922 nan -

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL


Results for intfloat/multilingual-e5-small

task_name google/gemini-embedding-001 intfloat/multilingual-e5-large intfloat/multilingual-e5-small Max result Model with max result In Training Data
CMedQAv1-reranking nan 0.6765 0.6468 0.9434 Kingsoft-LLM/QZhou-Embedding False
CMedQAv2-reranking nan 0.6676 0.6404 0.9444 IEITYuan/Yuan-embedding-2.0-zh False
DBPedia-PL nan 0.3578 0.2927 0.4319 BAAI/bge-multilingual-gemma2 False
DiaBlaBitextMining 0.8723 0.8483 0.8192 0.8882 codefuse-ai/F2LLM-v2-14B False
EightTagsClustering nan 0.2652 0.209 0.5029 BAAI/bge-multilingual-gemma2 False
HotpotQA-PL nan 0.6741 0.6016 0.7703 BAAI/bge-multilingual-gemma2 True
MSMARCO-PL nan 0.6991 0.5772 0.7269 BAAI/bge-multilingual-gemma2 True
NQ-PL nan 0.5282 0.4045 0.5685 BAAI/bge-multilingual-gemma2 True
NorwegianParliamentClassification 0.5672 0.5606 0.5657 0.7007 Qwen/Qwen3-Embedding-8B False
PpcPC 0.955 0.9116 0.8774 0.9576 microsoft/harrier-oss-v1-27b False
Quora-PL nan 0.837 0.787 0.8563 sdadas/mmlw-e5-large False
ScalaClassification 0.5185 0.5109 0.5041 0.9112 microsoft/harrier-oss-v1-27b False
Average 0.7282 0.6281 0.5771 0.7669 nan -

Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL



Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

@KennethEnevoldsen KennethEnevoldsen changed the title Remove old results rerun old results of multilingual-e5 small, medium, large Apr 5, 2026
@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

A few of the results are notably lower - it is not by a lot. I am unsure if it worth looking too much into it

@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

@Samoed should we just merge these or should we look into the scores?

@Samoed
Copy link
Copy Markdown
Member Author

Samoed commented Apr 16, 2026

Which results are lower? I tried to search such tasks, but can't find

@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

it is the reason why the tests fails:

FAILED tests/test_results_diff.py::test_result_diffs_within_threshold - AssertionError: Main score changes exceed configured threshold (MTEB_SCORE_EPSILON=0.001):
    intfloat/multilingual-e5-base/ArxivClusteringP2P: The difference between the current score (0.437572) and the previous (0.43349302457921374) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/ArxivClusteringS2S: The difference between the current score (0.363116) and the previous (0.3599685331336122) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/DiaBlaBitextMining: The difference between the current score (0.846981) and the previous (0.8346846148516296) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/DiaBlaBitextMining: The difference between the current score (0.846981) and the previous (0.8346846148516296) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/NorwegianParliamentClassification: The difference between the current score (0.56225) and the previous (0.5585000000000001) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/NorwegianParliamentClassification: The difference between the current score (0.5595) and the previous (0.5548333333333333) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/PpcPC: The difference between the current score (0.882969) and the previous (0.8802438485182167) exceeds threshold of 0.001
    intfloat/multilingual-e5-base/ScalaClassification: The difference between the current score (0.506885) and the previous (0.508544921875) exceeds threshold of 0.001
    intfloat/multilingual-e5-large/NorwegianParliamentClassification: The difference between the current score (0.561833) and the previous (0.563167) exceeds threshold of 0.001
    intfloat/multilingual-e5-small/CMedQAv2-reranking: The difference between the current score (0.6404) and the previous (0.6423767434864328) exceeds threshold of 0.001

In general it is small differences, but differences larger than the threshold

@Samoed
Copy link
Copy Markdown
Member Author

Samoed commented Apr 17, 2026

Ah, forgot about test. I think the only big difference is in DiaBlaBitextMining, but I don't have idea why here is such big gap

@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

hmm, but I suspect it is the change of implementation to sentence trf (but maybe worth looking into why there is a difference?). Seems like it generally gets a bit worse

@Samoed
Copy link
Copy Markdown
Member Author

Samoed commented Apr 17, 2026

Most of the tasks are classification and clustering. We changed (fixed, previously they were random) random seeds for them and this can cause difference

@KennethEnevoldsen
Copy link
Copy Markdown
Contributor

I will assume it is either the seed or the change in the model implementation. Either way, these results are now reproducible so will merge

@KennethEnevoldsen KennethEnevoldsen merged commit 6b80cc6 into main Apr 19, 2026
2 of 3 checks passed
@Samoed Samoed mentioned this pull request Apr 19, 2026
@Samoed Samoed deleted the remove_old_results branch May 6, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants