Topic: llms-benchmarking Goto Github

Some thing interesting about llms-benchmarking

👇 Here are 27 public repositories matching this topic...

aflah02 / humans-v-s-llm-benchmarks

llms-benchmarking,LLM Benchmarks play a crucial role in assessing the performance of Language Model Models (LLMs). However, it is essential to recognize that these benchmarks have their own limitations. This interactive tool is designed to engage users in a quiz game based on popular LLM benchmarks, offering an insightful way to explore and understand them

User: aflah02

Home Page: https://play-with-llm-benchmarks.streamlit.app/

llms llms-benchmarking streamlit

amit-sarker / icl-analysis-nlp-685

llms-benchmarking,

User: amit-sarker

btlms cerebras huggingface in-context-learning llama2 llms-benchmarking mamba-state-space-models mistral-7b sentiment-analysis arithemtic-tasks

chemfoundationmodels / chemllmbench

llms-benchmarking,What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

Organization: chemfoundationmodels

Home Page: https://arxiv.org/abs/2305.18365

benchmark chemistry large-language-models llm nlp ai4science llms-benchmarking

chotom / guardrails-dss-ml-2024

llms-benchmarking,Demo showcase highlighting the capabilities of Guardrails in LLMs.

User: chotom

guardrails llms llms-benchmarking

declare-lab / resta

llms-benchmarking,Restore safety in fine-tuned language models through task arithmetic

Organization: declare-lab

alignment alignment-algorithm llm llm-safety llm-safety-benchmark llms llms-benchmarking safety

dinesh-kumar-mr / medivqa

llms-benchmarking,Part of our final year project work involving complex NLP tasks along with experimentation on various datasets and different LLMs

User: dinesh-kumar-mr

llms llms-benchmarking medical-application vqa vqa-dataset vqa-med-2018

dippatel1994 / large-language-models-evaluation-benchmarks-collection

llms-benchmarking,This repository contains a list of benchmarks used by big orgs to evaluate their LLMs.

User: dippatel1994

benchmarks large-language-models llm llms llms-benchmarking

epfl-dlab / cc_flows

llms-benchmarking,The data and implementation for the experiments in the paper "Flows: Building Blocks of Reasoning and Collaborating AI".

Organization: epfl-dlab

Home Page: https://epfl-dlab.github.io/aiflows/

agents ai aiflows competitive-coding competitive-programming competitive-programming-contests llms llms-benchmarking llms-reasoning

evilpsycho / open-llm-benchmark

llms-benchmarking,Evaluate open-source language models on Agent, formatted output, command following, long text, multilingual, coding, and custom task capabilities. 开源语言模型在Agent，格式化输出，指令追随，长文本，多语言，代码，自定义任务的能力基准测试。

User: evilpsycho

evaluation-framework huggingface large-language-models llamacpp llm-agent llms-benchmarking openai vllm

lamalab-org / chem-bench

llms-benchmarking,How good are LLMs at chemistry?

Organization: lamalab-org

Home Page: https://www.chembench.org/

benchmark chemistry llm llms llms-benchmarking machine-learning materials-science safety

logikon-ai / cot-eval

llms-benchmarking,A framework for evaluating the effectiveness of chain-of-thought reasoning in language models.

Organization: logikon-ai

Home Page: https://huggingface.co/spaces/logikon/open_cot_leaderboard

chain-of-thought gen-ai leaderboard llm llms-benchmarking llms-reasoning

lwachowiak / llms-for-social-robotics

llms-benchmarking,Code and data for the paper: "Are Large Language Models Aligned with People's Social Intuitions for Human–Robot Interactions?"

User: lwachowiak

alignment hri llms llms-benchmarking social-robotics value-alignment vlm

melvinebenezer / liah-lie_in_a_haystack

llms-benchmarking,needle in a haystack for LLMs

User: melvinebenezer

llms-benchmarking long-context needle-in-haystack llm llm-inference

microsoft / private-benchmarking

llms-benchmarking,A platform that enables users to perform private benchmarking of machine learning models. The platform facilitates the evaluation of models based on different trust levels between the model owners and the dataset owners.

Organization: microsoft

benchmarking inference llms-benchmarking mpc private private-benchmarking secure ezpc large-language-models contamination

minnesotanlp / cobbler

llms-benchmarking,Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

Organization: minnesotanlp

Home Page: https://minnesotanlp.github.io/cobbler-project-page/

bias evaluation llm nlp bias-detection llm-as-a-judge llm-as-evaluator llm-as-judge llm-evaluation llms

nachodrt / merit-dataset

llms-benchmarking,The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. This repository is actively maintained, and new features are continuously being added.

User: nachodrt

biases layoutlm layoutlmv2 layoutlmv3 layoutxlm llms-benchmarking synthetic-dataset synthetic-dataset-generation token-classification

parea-ai / parea-sdk-py

llms-benchmarking,Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Organization: parea-ai

Home Page: https://docs.parea.ai/sdk/python

llm llm-evaluation llm-tools llmops llms-benchmarking llm-eval llm-evaluation-framework llm-evaluation-toolkit prompt-engineering generative-ai

parea-ai / parea-sdk-ts

llms-benchmarking,TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

Organization: parea-ai

Home Page: https://docs.parea.ai/sdk/typescript

llm llm-evaluation llm-evaluation-framework llm-evaluation-toolkit llm-tools llms llms-benchmarking llm-eval prompt-engineering

paulescu / text-embedding-evaluation

llms-benchmarking,Join 15k builders to the Real-World ML Newsletter ⬇️⬇️⬇️

User: paulescu

Home Page: https://www.realworldml.net/subscribe

embeddings llms llms-benchmarking machine-learning

princysinghal / html-code-generation-from-llm

llms-benchmarking,Fine-Tuning and Evaluating a Falcon 7B Model for generating HTML code from input prompts.

User: princysinghal

code-generation fine-tuning-llm llms llms-benchmarking machine-learning

s2e-lab / regexeval

llms-benchmarking,Source code for the accepted paper in ICSE-NIER'24: Re(gEx|DoS)Eval: Evaluating Generated Regular Expressions and their Proneness to DoS Attacks.

Organization: s2e-lab

Home Page: https://s2e-lab.github.io/paper/research/dataset/icse-nier-2024/

benchmark-framework code-generation llms-benchmarking redos-checker redos-detector regex

santhoshi-ravi / minerva

llms-benchmarking,Evaluating and enhancing Large Language Models (LLMs) using mathematical datasets through innovative Multi-Agent Debate Architecture, without traditional fine-tuning or Retrieval-Augmented Generation techniques. This project explores advanced strategies to boost LLM capabilities in mathematical reasoning.

User: santhoshi-ravi

llm llms-benchmarking multi-agent-debate

saqib727 / artifical-intelligence-projects

llms-benchmarking,You Can see The Top Artificial Intelligence Projects Based on Real Use cases. 😃 Why wait More when you have all things at one place. 😎

User: saqib727

ai disease-prediction fakenews fraudsensor handwritten-digit-recognition heartdisease llms-benchmarking llmsecurity machine-learning machine-translation

saqib727 / blog-assistant

llms-benchmarking,BlogCraft is a web application built with Streamlit that leverages AI to assist in crafting blog posts effortlessly.

User: saqib727

blogging llm llms-benchmarking

saqib727 / medical-analyst

llms-benchmarking,Vital Image Analytics is an AI-powered application designed to assist healthcare professionals in analyzing medical images for diagnostic purposes.

User: saqib727

gemini gemini-api llms llms-benchmarking medical-application medical-device-false medical-image-analysis medical-image-processing medical-image-segmentation medical-images