InternLM-XComposer/InternLM-XComposer-2.5-OmniLive at main · InternLM/InternLM-XComposer

Name	Name	Last commit message	Last commit date
parent directory ..
assets	assets
benchmarks	benchmarks
examples	examples
internlm-xcomposer2d5-ol-7b	internlm-xcomposer2d5-ol-7b
online_demo	online_demo
online_demo_gradio	online_demo_gradio
Dockerfile	Dockerfile
IXC2.5-OL.pdf	IXC2.5-OL.pdf
README.md	README.md
README_CN.md	README_CN.md
__init__.py	__init__.py

InternLM-XComposer2.5-OmniLive (IXC2.5-OL)

InternLM-XComposer2.5-OmniLive 🤗

｜ IXC2.5-OL Technical Report 📄

English | 简体中文

👋 join us on Discord and WeChat

Demo Video

🔥 For the best experience, please keep the audio on while enjoying the video.

demo.mp4

Requirements

python 3.8 and above
pytorch 1.12 and above, 2.0 and above are recommended
CUDA 11.4 and above are recommended (this is for GPU users)
flash-attention2 is required for high-resolution usage of InternLM-XComposer2.5.

Installation

Before running the code, make sure you have set up the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Please refer to the installation instructions

Docker Image

We have also created a Docker image to simplify your setup process. You can find it here: ixc-ol Docker Image. You can pull the image via

docker pull yhcao6/ixc2.5-ol:latest

Quickstart

We provide simple examples below to show how to use InternLM-XComposer-2.5-OL with 🤗 Transformers. For complete guide, please refer to here.

Audio Understanding

import os
os.environ['USE_HF'] = 'True'

import torch
from swift.llm import (
    get_model_tokenizer, get_template, ModelType,
    get_default_template_type, inference
)
from swift.utils import seed_everything

model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')

model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
                                       model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)

# Chinese ASR
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')

Image Understanding

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer

query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
    response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)

Video Understanding

Please refer to infer_llm_with_memory.py.

Interactive Demo Deploy

Select one of the two deployment options for the demo. The second option excludes the need for an SRS Server but lacks support for real-time interruption.

SRS Server + Frontend based on JavaScript + Backend with FastAPI

Please refer to Demo Setup Guide for guidelines.

Frontend based on Gradio + Backend with FastAPI

Please refer to Gradio Demo Setup Guide for guidelines.

Evaluation

We evaluate InternLM-XComposer-2.5-OL on multimodal benchmarks, including audio, video and streaming benchmarks. For complete comparisons, please fer to our technique report.

ASR benchmarks WenetSpeech and LibriSpeech.

Method	LLM	Wenetspeech		Librispeech
		Test_Net	Test_Meeting	Dev_Clean	Dev_Other	Test_Clean	Test_Other
Qwen2-Audio	Qwen2-7B	7.8	8.4	1.3	3.4	1.6	3.6
Mini-Omni	Qwen2-0.5B	-	-	4.5	9.7	4.6	9.2
VITA	Mixtral-8x7B	12.2	16.5	7.6	16.6	8.1	18.4
IXC2.5-OL	Qwen2-1.5B	9.0	9.2	2.5	5.7	2.6	5.8

Video benchmark MLVU

Inference Code

Download the videos from MLVU and save them in the directory (e.g., './video/mlvu')

└── video/                
   └── mlvu/           
      ├── 1_plotQA/ 
      │    ├──1.mp4
      │    ...
      ├── 2_needle/ 
      ├── 3_ego/ 
      ├── 4_count/ 
      ├── 5_order/
      ├── 6_anomaly_reco/  
      └── 7_topic_reasoning/

sh benchmarks/mlvu/mlvu.sh ./video/mlvu

Results

Method	Params	Topic Rea.	Anomaly Recog.	Needle QA	Ego Rea.	Plot QA	Action Or.	Action Co.	M-Avg
Closed-source APIs
Claude-3-Opus	-	67.2	43.5	21.6	40.2	47.8	18.2	16.7	36.5
Qwen-VL-Max	-	67.4	63.5	40.3	40.9	43.3	25.0	14.8	42.2
GPT-4 Turbo	-	79.5	68.0	45.9	47.4	60.6	26.5	16.1	49.2
GPT-4o	-	87.4	74.5	64.8	57.1	65.1	56.7	46.3	64.6
Open-source models
MovieChat	7B	29.5	25.0	24.2	24.7	25.8	28.6	22.8	25.8
LLaMA-VID	7B	50.8	34.5	30.1	32.7	32.5	23.9	27.8	33.2
LLaVA-1.6	7B	60.6	41.0	43.1	38.4	41.0	25.5	25.7	39.3
ShareGPT4Video	7B	75.8	51.5	47.6	43.2	48.4	34.0	23.3	46.4
VideoLlaMA2	7B	74.6	64.5	49.9	43.8	45.1	34.0	27.4	48.5
LongVA	7B	83.3	58.5	69.3	50.0	67.2	38.6	27.2	56.3
IXC2.5	7B	-	-	-	-	-	-	-	58.8
InternVL2	8B	-	-	-	-	-	-	-	64.0
LLaVA-OneVision	7B	-	-	-	-	-	-	-	64.7
Video-XL	7B	-	-	-	-	-	-	-	64.9
IXC2.5-OL	7B	84.1	68.5	76.6	60.8	75.1	57.1	41.3	66.2

Video benchmark Video-MME

Inference Code

Download the videos from VideoMME and save them in the directory (e.g., './video/video_mme')

└── video/                
   └── video_mme/           
      ├── 026dzf-vc5g.mp4
      ├── 068rdc75mHM.mp4
      ├── 08km9Yqbt-A.mp4
      ├── 0ag_Qi5OEd0.mp4
          ...

sh benchmarks/video_mme/video_mme.sh ./video/video_mme

Results

Method	Params	Short Video	Medium Video	Long Video	Overall
Closed-source APIs
GPT-4V	-	70.5	55.8	53.5	59.9
Claude 3.5 Sonnet	-	71.0	57.4	51.2	60.0
GPT-4o mini	-	72.5	63.1	58.6	64.8
GPT-4o	-	80.0	70.3	65.3	71.9
Gemini 1.5 Pro	-	81.7	74.3	67.4	75.0
Open-source models
ShareGPT4Video	7B	48.3	36.3	35.0	39.9
VideoLlaMA2	7B	-	-	-	47.9
LongVA	7B	61.1	50.4	46.2	52.6
Video-XL	7B	64.0	53.2	49.2	55.5
VITA	8×7B	65.9	52.9	48.6	55.8
IXC2.5	7B	-	-	-	55.8
InternVL2	8B	-	-	-	56.3
LLaVA-OneVision	7B	-	-	-	58.2
mPLUG-Owl3	7B	70.0	57.7	50.1	59.3
MiniCPM-V 2.6	8B	-	-	-	60.9
IXC2.5-OL	7B	72.7	58.2	50.8	60.6

Streaming benchmark StreamingBench

Inference Code

Download the videos from StreamingBench and save them in the directory (e.g., './video/StreamingBench')

└── video/                
   └── StreamingBench/           
      └── real/ 
          ├──sample_1/
          │    └── video.mp4
          ├──sample_10/
          │    └── video.mp4
          ├──sample_12/
          ...

sh benchmarks/streamingbench/eval.sh ./video/StreamingBench

Results

Method	Params	OP	CR	CS	ATP	EU	TR	PR	SU	ACP	CT	Overall
Human	-	89.47	92.00	93.60	91.47	95.65	92.52	88.00	88.75	89.74	91.30	91.46
Closed-source APIs
Claude 3.5 Sonnet	-	80.49	77.34	82.02	81.73	72.33	75.39	61.11	61.79	69.32	43.09	72.44
GPT-4o	-	77.11	80.47	83.91	76.47	70.19	83.80	66.67	62.19	69.12	49.22	73.28
Gemini 1.5 Pro	-	79.02	80.47	83.54	79.67	80.00	84.74	77.78	64.23	71.95	48.70	75.69
Open-source models
VideoLLM-online	8B	39.07	40.06	34.49	31.05	45.96	32.40	31.48	34.16	42.49	27.89	35.99
VideoLLaMA2	7B	55.86	55.47	57.41	58.17	52.80	43.61	39.21	42.68	45.61	35.23	49.52
VILA-1.5	8B	53.68	49.22	70.98	56.86	53.42	53.89	54.63	48.78	50.14	17.62	52.32
LongVA	7B	70.03	63.28	61.20	70.92	62.73	59.50	61.11	53.66	54.67	34.72	59.96
InternVL2	8B	68.12	60.94	69.40	77.12	67.70	62.93	59.26	53.25	54.96	56.48	63.72
Kangaroo	7B	71.12	84.38	70.66	73.20	67.08	61.68	56.48	55.69	62.04	38.86	64.60
MiniCPM-V 2.6	8B	71.93	71.09	77.92	75.82	64.60	65.73	70.37	56.10	62.32	53.37	67.44
Qwen2-VL	7B	75.20	82.81	73.19	77.45	68.32	71.03	72.22	61.19	69.04	46.11	69.04
LLaVA-OneVision	7B	80.38	74.22	76.03	80.72	72.67	71.65	67.59	65.45	65.72	45.08	71.12
IXC2.5-OL	7B	82.83	73.77	78.66	82.95	72.50	76.01	61.11	60.67	71.59	58.85	73.79

Video benchmark MVBench

Inference Code

Download the videos from MVBench and save them in the directory (e.g., './video/mvbench')

└── video/                
   └── mvbench/           
      ├── clevrer/ 
      │   └── video_validation/
      │         ├── video_10009.mp4
      │         ├── video_10016.mp4
      │         ├── video_10017.mp4
      │         ...
      ├── FunQA_test/
      │   └── test/
      │         ├──test_creative/
      │         │  ├── C_KT_10_6402_6422.mp4
      │         │  ├── C_KT_12_1452_1602.mp4
      │         │  ├── C_KT_12_5112_5200.mp4
      │         │  ...
      │         ├──test_humor/
      │         │  ├── H_A_101_1433_1631.mp4
      │         │  ├── H_A_112_0436_0691.mp4
      │         │  ├── H_A_125_2078_2286.mp4
      │         │  ... 
      │         ...
      ...

sh benchmarks/mvbench/mvbench.sh ./video/mvbench

Results

Method	Params	AS	AP	AA	FA	UA	OE	OI	OS	MD	AL	ST	AC	MC	MA	SC	FP	CO	EN	ER	CI	Avg
Closed-source APIs
GPT-4V	-	55.5	63.5	72.0	46.5	73.5	18.5	59.0	29.5	12.0	40.5	83.5	39.0	12.0	22.5	45.0	47.5	52.0	31.0	59.0	11.0	43.5
GPT-4o	-	61.5	56.5	72.0	54.0	82.0	62.5	66.5	44.0	36.5	33.5	93.0	54.5	33.5	54.5	53.5	74.5	71.5	32.5	71.0	42.5	57.5
Open-source models
VideoLLaMA	7B	27.5	25.5	51.0	29.0	39.0	48.0	40.5	38.0	22.5	22.5	43.0	34.0	22.5	32.5	45.5	32.5	40.0	30.0	21.0	37.0	34.1
VideoChat	7B	33.5	26.5	56.0	33.5	40.5	53.0	40.5	30.0	25.5	27.0	48.5	35.0	20.5	42.5	46.0	26.5	41.0	23.5	23.5	36.0	35.5
MiniCPM-V 2.6	7B	38.0	43.0	63.0	35.5	67.5	55.5	46.0	35.5	25.5	33.0	77.5	48.0	37.0	54.0	42.5	40.0	31.0	38.0	43.0	40.5	44.7
VideoChat2	7B	66.0	47.5	83.5	49.5	60.0	58.0	71.5	42.5	23.0	23.0	88.5	39.0	42.0	58.5	44.0	49.0	36.5	35.0	40.5	65.5	51.1
Qwen2-VL	7B	51.0	58.0	77.5	47.0	64.0	63.0	65.5	40.0	25.5	35.5	77.0	43.5	47.0	62.0	42.0	61.5	49.5	41.5	47.5	41.5	52.0
PLLaVA	34B	65.0	53.0	83.5	45.0	77.5	70.0	64.5	38.5	37.5	49.0	89.5	41.5	43.5	70.0	53.0	52.5	65.0	39.5	60.5	58.0	57.8
LLaVA-OneVision	72B	63.0	58.0	84.5	46.5	85.5	64.0	73.5	41.5	37.0	69.0	95.0	47.5	47.5	75.5	53.5	52.0	70.5	34.0	64.0	54.5	60.8
InternVL2	8B	75.0	62.0	83.5	40.5	69.5	96.0	72.0	29.5	58.0	53.0	88.5	39.5	83.0	97.0	51.0	78.5	65.0	33.0	48.0	67.0	64.5
IXC2.5-OL	7B	84.5	81.0	75.0	46.0	81.0	92.0	79.5	36.5	83.0	47.0	90.0	60.5	75.0,	93.0	58.0	60.5	74.0	42.0	53.0	62.0	68.7

Video benchmark MMBench-Video

Inference Code

We use the VLMEvalKit to eval MMBench-Video. Please refer to VLMEvalKit.

# Replace the model_path of XComposer2d5 from internlm/internlm-xcomposer2d5-7b to internlm-xcomposer2d5-ol-7b/base in vlmeval/config.py
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model XComposer2d5 --nframe 64

Results

Method	Params	CP	FP-S	FP-C	HL	LR	AR	RR	CSR	TP	Overall
Closed-source APIs
Claude 3.5 Sonnet	-	1.57	1.39	1.07	1.40	1.13	1.70	1.48	1.54	1.04	1.38
Gemini 1.0 Pro	-	1.61	1.56	1.30	0.65	1.15	1.57	1.55	1.36	1.33	1.48
Gemini 1.5 Pro	-	1.99	2.04	1.70	1.90	1.98	2.02	1.92	1.78	1.63	1.94
GPT-4V	-	1.83	1.65	1.40	1.76	1.66	1.91	1.86	1.83	1.53	1.68
GPT-4o	-	2.23	2.24	2.01	1.90	2.19	2.12	2.17	1.94	1.97	2.15
Open-source APIs
MovieLLM	7B	0.95	0.82	0.70	0.15	0.52	1.12	1.22	0.54	1.05	0.87
LLaVA-OneVision	72B	1.22	1.07	0.90	0.21	0.76	0.96	0.55	0.81	0.48	0.94
PLLaVA	7B	1.08	1.06	0.86	0.52	0.64	1.25	1.17	0.98	1.01	1.03
ShareGPT4Video	7B	1.20	1.05	1.00	0.32	0.89	1.06	1.19	1.01	0.99	1.05
VideoStreaming	7B	1.38	1.13	0.8	0.32	0.77	1.27	1.11	1.01	1.10	1.12
LLaVA-NeXT-Video	7B	1.35	1.15	0.97	0.58	0.64	1.38	1.30	1.27	1.03	1.14
VILA1.5	13B	1.51	1.45	1.26	0.24	0.80	1.52	1.30	1.40	1.28	1.36
InternVL2	8B	1.41	1.37	1.15	0.19	0.90	1.34	1.38	1.14	1.00	1.26
Qwen2-VL	7B	1.63	1.51	1.19	0.55	1.16	1.56	1.49	1.37	1.21	1.44
IXC2.5-OL	7B	1.53	1.61	1.20	0.15	0.93	1.44	1.57	1.30	1.08	1.42

Citation

If you find our models / code / papers useful in your research, please consider giving ⭐ and citations 📝, thx :)

@article{internlmxcomposer2_5_OL,
      title={InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions}, 
      author={Pan Zhang and Xiaoyi Dong and Yuhang Cao and Yuhang Zang and Rui Qian and Xilin Wei and Lin Chen and Yifei Li and Junbo Niu and Shuangrui Ding and Qipeng Guo and Haodong Duan and Xin Chen and Han Lv and Zheng Nie and Min Zhang and Bin Wang and Wenwei Zhang and Xinyue Zhang and Jiaye Ge and Wei Li and Jingwen Li and Zhongying Tu and Conghui He and Xingcheng Zhang and Kai Chen and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2412.09596},
      year={2024}
}

@article{internlmxcomposer2_5,
      title={InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output}, 
      author={Pan Zhang and Xiaoyi Dong and Yuhang Zang and Yuhang Cao and Rui Qian and Lin Chen and Qipeng Guo and Haodong Duan and Bin Wang and Linke Ouyang and Songyang Zhang and Wenwei Zhang and Yining Li and Yang Gao and Peng Sun and Xinyue Zhang and Wei Li and Jingwen Li and Wenhai Wang and Hang Yan and Conghui He and Xingcheng Zhang and Kai Chen and Jifeng Dai and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2407.03320},
      year={2024}
}

@article{internlmxcomposer2_4khd,
      title={InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD},
      author={Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Bin Wang and Linke Ouyang and Songyang Zhang and Haodong Duan and Wenwei Zhang and Yining Li and Hang Yan and Yang Gao and Zhe Chen and Xinyue Zhang and Wei Li and Jingwen Li and Wenhai Wang and Kai Chen and Conghui He and Xingcheng Zhang and Jifeng Dai and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2404.06512},
      year={2024}
}

@article{internlmxcomposer2,
      title={InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model},
      author={Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Bin Wang and Linke Ouyang and Xilin Wei and Songyang Zhang and Haodong Duan and Maosong Cao and Wenwei Zhang and Yining Li and Hang Yan and Yang Gao and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2401.16420},
      year={2024}
}

@article{internlmxcomposer,
      title={InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition},
      author={Pan Zhang and Xiaoyi Dong and Bin Wang and Yuhang Cao and Chao Xu and Linke Ouyang and Zhiyuan Zhao and Shuangrui Ding and Songyang Zhang and Haodong Duan and Wenwei Zhang and Hang Yan and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2309.15112},
      year={2023}
}

License & Contact Us

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表（中文）. For other questions or collaborations, please contact [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InternLM-XComposer-2.5-OmniLive

InternLM-XComposer-2.5-OmniLive

README.md

Demo Video

Requirements

Installation

Docker Image

Quickstart

Interactive Demo Deploy

Evaluation

ASR benchmarks WenetSpeech and LibriSpeech.

Video benchmark MLVU

Results

Video benchmark Video-MME

Results

Streaming benchmark StreamingBench

Results

Video benchmark MVBench

Results

Video benchmark MMBench-Video

Results

Citation

License & Contact Us

Files

InternLM-XComposer-2.5-OmniLive

Directory actions

More options

Directory actions

More options

Latest commit

History

InternLM-XComposer-2.5-OmniLive

Folders and files

parent directory

README.md

Demo Video

Requirements

Installation

Docker Image

Quickstart

Interactive Demo Deploy

Evaluation

ASR benchmarks WenetSpeech and LibriSpeech.

Video benchmark MLVU

Results

Video benchmark Video-MME

Results

Streaming benchmark StreamingBench

Results

Video benchmark MVBench

Results

Video benchmark MMBench-Video

Results

Citation

License & Contact Us