rerun old results of multilingual-e5 small, medium, large#474
rerun old results of multilingual-e5 small, medium, large#474KennethEnevoldsen merged 3 commits intomainfrom
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | BAAI/bge-m3 | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|
| LEMBNarrativeQARetrieval | 0.4879 | 0.2422 | 0.7690 | lightonai/GTE-ModernColBERT-v1 | False |
| LEMBNeedleRetrieval | 0.395 | 0.28 | 0.9325 | lightonai/GTE-ModernColBERT-v1 | False |
| LEMBQMSumRetrieval | 0.3611 | 0.2426 | 0.8323 | mteb/baseline-bm25s | False |
| LEMBSummScreenFDRetrieval | 0.9389 | 0.7112 | 0.9784 | mteb/baseline-bm25s | False |
| LEMBWikimQARetrieval | 0.7946 | 0.568 | 0.9988 | lightonai/GTE-ModernColBERT-v1 | False |
| Average | 0.5955 | 0.4088 | 0.9022 | nan | - |
Training datasets: CMedQAv1-reranking, CMedQAv2-reranking, CmedqaRetrieval, CodeSearchNet, DuRetrieval, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, LeCaRDv2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MMarcoReranking, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, T2Reranking, T2Retrieval, mMARCO-NL
Results for intfloat/e5-base
| task_name | google/gemini-embedding-001 | intfloat/e5-base | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| DKHateClassification | 0.8702 | 0.5863 | nan | 0.8702 | google/gemini-embedding-001 | False |
| NorwegianParliamentClassification | 0.5672 | 0.5593 | 0.5606 | 0.7007 | Qwen/Qwen3-Embedding-8B | False |
| ScalaClassification | 0.5185 | 0.4991 | 0.5109 | 0.9112 | microsoft/harrier-oss-v1-27b | False |
| Average | 0.6520 | 0.5482 | 0.5357 | 0.8274 | nan | - |
Training datasets: MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, mMARCO-NL
Results for intfloat/multilingual-e5-base
| task_name | google/gemini-embedding-001 | intfloat/multilingual-e5-base | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| ArxivClusteringP2P | nan | 0.4376 | 0.4473 | 0.6092 | TencentBAC/Conan-embedding-v2 | False |
| ArxivClusteringS2S | nan | 0.3631 | 0.3871 | 0.5520 | TencentBAC/Conan-embedding-v2 | False |
| BUCC | nan | 0.9818 | 0.9872 | 0.9900 | intfloat/multilingual-e5-large-instruct | False |
| CLSClusteringP2P | nan | 0.386 | 0.3915 | 0.8225 | Bytedance/Seed1.6-embedding | False |
| CLSClusteringS2S | nan | 0.3686 | 0.3682 | 0.7627 | tencent/Youtu-Embedding | False |
| CMedQAv1-reranking | nan | 0.6646 | 0.6765 | 0.9434 | Kingsoft-LLM/QZhou-Embedding | False |
| CMedQAv2-reranking | nan | 0.6626 | 0.6676 | 0.9444 | IEITYuan/Yuan-embedding-2.0-zh | False |
| Cmnli | nan | 0.6735 | 0.6983 | 0.9579 | Kingsoft-LLM/QZhou-Embedding-Zh | False |
| DBPedia-PL | nan | 0.3014 | 0.3578 | 0.4319 | BAAI/bge-multilingual-gemma2 | False |
| DiaBlaBitextMining | 0.8723 | 0.847 | 0.8483 | 0.8882 | codefuse-ai/F2LLM-v2-14B | False |
| EightTagsClustering | nan | 0.2235 | 0.2652 | 0.5029 | BAAI/bge-multilingual-gemma2 | False |
| FEVER | nan | 0.7942 | 0.8281 | 0.9628 | voyageai/voyage-3-m-exp | True |
| HotpotQA-PL | nan | 0.6347 | 0.6741 | 0.7703 | BAAI/bge-multilingual-gemma2 | True |
| MSMARCO-PL | nan | 0.6198 | 0.6991 | 0.7269 | BAAI/bge-multilingual-gemma2 | True |
| NQ-PL | nan | 0.4479 | 0.5282 | 0.5685 | BAAI/bge-multilingual-gemma2 | True |
| NorwegianParliamentClassification | 0.5672 | 0.5609 | 0.5606 | 0.7007 | Qwen/Qwen3-Embedding-8B | False |
| Ocnli | nan | 0.5901 | 0.5864 | 0.9518 | Kingsoft-LLM/QZhou-Embedding-Zh | False |
| PpcPC | 0.955 | 0.883 | 0.9116 | 0.9576 | microsoft/harrier-oss-v1-27b | False |
| QBQTC | nan | 0.2831 | 0.2747 | 0.6156 | Kingsoft-LLM/QZhou-Embedding-Zh | False |
| Quora-PL | nan | 0.8131 | 0.837 | 0.8563 | sdadas/mmlw-e5-large | False |
| ScalaClassification | 0.5185 | 0.5089 | 0.5109 | 0.9112 | microsoft/harrier-oss-v1-27b | False |
| ThuNewsClusteringP2P | nan | 0.5318 | 0.5593 | 0.8976 | Kingsoft-LLM/QZhou-Embedding-Zh | False |
| ThuNewsClusteringS2S | nan | 0.5371 | 0.5456 | 0.8955 | tencent/Youtu-Embedding | False |
| Average | 0.7282 | 0.5702 | 0.5918 | 0.7922 | nan | - |
Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL
Results for intfloat/multilingual-e5-large
| task_name | google/gemini-embedding-001 | intfloat/multilingual-e5-large | intfloat/multilingual-e5-large | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| Revisions | ab10c1a7f42e74530fe7ae5be82e6d4f11a719eb | external | ||||
| ArxivClusteringP2P | nan | 0.4473 | 0.4431 | 0.6092 | TencentBAC/Conan-embedding-v2 | False |
| ArxivClusteringS2S | nan | 0.3871 | 0.3843 | 0.5520 | TencentBAC/Conan-embedding-v2 | False |
| BUCC | nan | 0.9872 | 0.9855 | 0.9900 | intfloat/multilingual-e5-large-instruct | False |
| CLSClusteringP2P | nan | 0.3915 | nan | 0.8225 | Bytedance/Seed1.6-embedding | False |
| CLSClusteringS2S | nan | 0.3682 | nan | 0.7627 | tencent/Youtu-Embedding | False |
| CMedQAv1-reranking | nan | 0.6765 | nan | 0.9434 | Kingsoft-LLM/QZhou-Embedding | False |
| CMedQAv2-reranking | nan | 0.6676 | nan | 0.9444 | IEITYuan/Yuan-embedding-2.0-zh | False |
| Cmnli | nan | 0.6983 | nan | 0.9579 | Kingsoft-LLM/QZhou-Embedding-Zh | False |
| DBPedia-PL | nan | 0.3578 | nan | 0.4319 | BAAI/bge-multilingual-gemma2 | False |
| DiaBlaBitextMining | 0.8723 | 0.8483 | nan | 0.8882 | codefuse-ai/F2LLM-v2-14B | False |
| EightTagsClustering | nan | 0.2652 | nan | 0.5029 | BAAI/bge-multilingual-gemma2 | False |
| FEVER | nan | 0.8279 | 0.8281 | 0.9628 | voyageai/voyage-3-m-exp | True |
| HotpotQA-PL | nan | 0.6741 | nan | 0.7703 | BAAI/bge-multilingual-gemma2 | True |
| MSMARCO-PL | nan | 0.6991 | nan | 0.7269 | BAAI/bge-multilingual-gemma2 | True |
| NQ-PL | nan | 0.5282 | nan | 0.5685 | BAAI/bge-multilingual-gemma2 | True |
| NorwegianParliamentClassification | 0.5672 | 0.5606 | nan | 0.7007 | Qwen/Qwen3-Embedding-8B | False |
| Ocnli | nan | 0.5864 | nan | 0.9518 | Kingsoft-LLM/QZhou-Embedding-Zh | False |
| PpcPC | 0.955 | 0.9116 | nan | 0.9576 | microsoft/harrier-oss-v1-27b | False |
| QBQTC | nan | 0.2747 | nan | 0.6156 | Kingsoft-LLM/QZhou-Embedding-Zh | False |
| Quora-PL | nan | 0.8370 | nan | 0.8563 | sdadas/mmlw-e5-large | False |
| ScalaClassification | 0.5185 | 0.5109 | nan | 0.9112 | microsoft/harrier-oss-v1-27b | False |
| ThuNewsClusteringP2P | nan | 0.5593 | nan | 0.8976 | Kingsoft-LLM/QZhou-Embedding-Zh | False |
| ThuNewsClusteringS2S | nan | 0.5456 | nan | 0.8955 | tencent/Youtu-Embedding | False |
| Average | 0.7282 | 0.5918 | 0.6602 | 0.7922 | nan | - |
Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL
Results for intfloat/multilingual-e5-small
| task_name | google/gemini-embedding-001 | intfloat/multilingual-e5-large | intfloat/multilingual-e5-small | Max result | Model with max result | In Training Data |
|---|---|---|---|---|---|---|
| CMedQAv1-reranking | nan | 0.6765 | 0.6468 | 0.9434 | Kingsoft-LLM/QZhou-Embedding | False |
| CMedQAv2-reranking | nan | 0.6676 | 0.6404 | 0.9444 | IEITYuan/Yuan-embedding-2.0-zh | False |
| DBPedia-PL | nan | 0.3578 | 0.2927 | 0.4319 | BAAI/bge-multilingual-gemma2 | False |
| DiaBlaBitextMining | 0.8723 | 0.8483 | 0.8192 | 0.8882 | codefuse-ai/F2LLM-v2-14B | False |
| EightTagsClustering | nan | 0.2652 | 0.209 | 0.5029 | BAAI/bge-multilingual-gemma2 | False |
| HotpotQA-PL | nan | 0.6741 | 0.6016 | 0.7703 | BAAI/bge-multilingual-gemma2 | True |
| MSMARCO-PL | nan | 0.6991 | 0.5772 | 0.7269 | BAAI/bge-multilingual-gemma2 | True |
| NQ-PL | nan | 0.5282 | 0.4045 | 0.5685 | BAAI/bge-multilingual-gemma2 | True |
| NorwegianParliamentClassification | 0.5672 | 0.5606 | 0.5657 | 0.7007 | Qwen/Qwen3-Embedding-8B | False |
| PpcPC | 0.955 | 0.9116 | 0.8774 | 0.9576 | microsoft/harrier-oss-v1-27b | False |
| Quora-PL | nan | 0.837 | 0.787 | 0.8563 | sdadas/mmlw-e5-large | False |
| ScalaClassification | 0.5185 | 0.5109 | 0.5041 | 0.9112 | microsoft/harrier-oss-v1-27b | False |
| Average | 0.7282 | 0.6281 | 0.5771 | 0.7669 | nan | - |
Training datasets: FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-PL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MrTidyRetrieval, MrTyDiJaRetrievalLite, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoFEVER-VN, NanoFEVERRetrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNQ-VN, NanoNQRetrieval, XQuADRetrieval, mMARCO-NL
Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.
|
A few of the results are notably lower - it is not by a lot. I am unsure if it worth looking too much into it |
|
@Samoed should we just merge these or should we look into the scores? |
|
Which results are lower? I tried to search such tasks, but can't find |
|
it is the reason why the tests fails: In general it is small differences, but differences larger than the threshold |
|
Ah, forgot about test. I think the only big difference is in |
|
hmm, but I suspect it is the change of implementation to sentence trf (but maybe worth looking into why there is a difference?). Seems like it generally gets a bit worse |
|
Most of the tasks are classification and clustering. We changed (fixed, previously they were random) random seeds for them and this can cause difference |
|
I will assume it is either the seed or the change in the model implementation. Either way, these results are now reproducible so will merge |
Ref embeddings-benchmark/mteb#3921
rerun old results of multilingual-e5 small, medium, large
Checklist
mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here