现在 @ 腾讯混元 · 推理框架实习Now @ Tencent Hunyuan · Inference Framework

庄宏。Hong Zhuang.

AI Infra 工程师 · 大模型训练与推理加速 AI Infra Engineer · LLM Training & Inference Acceleration

慕尼黑工业大学(TUM)信息学硕士在读 · 此前在华为昇腾模型中台主导长序列并行训练优化,并在 Moii.AI 构建实时小物体检测系统。 MSc Informatics @ Technical University of Munich · previously building long-sequence parallel training at Huawei Ascend and real-time small-object detection at Moii.AI.

联系我Let’s connect → 看看作品See my work

0% GPU 节省(Wan2.1)GPU saved (Wan2.1)
0% 显存下降memory cut
+0% 检测准确率CV accuracy
+0% 推理速度inference speed

德国慕尼黑Munich, Germany

PyTorch · vLLM TRL · VERL DeepSpeed-Ulysses

/01

关于我About

先快速了解一下:我现在在哪、在做什么、下一步在找什么。A quick snapshot of where I am, what I work on, and what I’m looking for next.

你好,我是庄宏 — 慕尼黑工业大学(TUM)信息学硕士在读。我专注于大规模 多模态与强化学习模型 的训练、扩展与加速。

目前在 腾讯混元基础组(AI Infra) 做推理框架实习,负责跨芯片算子精度验证统一框架与 vLLM 推理加速。此前在 华为昇腾模型中台 主导了 Wan2.1 14B 文生视频模型的长序列并行优化(与京东联合创新)— GPU 用量降低 75%,显存下降 50%;更早在 Moii.AI 参与构建实时枪支检测系统,获得 Amazon Sigma Award。

兴趣方向:模型工程、算法落地、模型加速、推理框架。

Hi, I’m Hong Zhuang — an MSc Informatics student at the Technical University of Munich (TUM). I work on training, scaling, and accelerating large multimodal and reinforcement-learning models.

Currently interning on Tencent’s Hunyuan Foundation Group (AI Infra), building a cross-chip operator-precision verification framework and vLLM inference acceleration. Previously at Huawei’s Ascend Model Platform, I drove long-sequence parallel optimization for the Wan2.1 14B text-to-video model (a joint innovation with JD) — cutting GPU usage by 75% and memory by 50%; and earlier co-built a real-time firearm detection system at Moii.AI that won the Amazon Sigma Award.

Interests: model engineering, algorithm deployment, model acceleration, and inference frameworks.

/02

专业技能Skills

Bento 视角的技术栈速览 — 按日常用途分组。A bento snapshot — favorite tools grouped by what they get me through the day.

A1 / Core

AI 框架与训练AI Frameworks & Training

从研究原型到多 NPU 全参强化学习微调的日常主力。Daily driver for everything from quick research prototypes to full-parameter multi-NPU RL fine-tuning.

PyTorch
HuggingFace
LLaMA-Factory
DiffSynth-Studio
Pandas

A2 / Acceleration

分布式与推理Distributed & Inference

长序列并行、PagedAttention,以及现代 LLM 推理服务栈。Long-sequence parallelism, paged attention, and the rest of the modern LLM serving stack.

DeepSpeed-Ulysses
vLLM
Sequence Parallel

A3 / RL

强化学习微调RL Fine-tuning

GRPO 类强化学习微调从 GPU 迁移到昇腾 NPU — 训练、推理、调优、部署全链路。GRPO-style RL fine-tuning ported from GPU to Ascend NPU — end-to-end training, debugging, and deployment.

TRL
VERL
GRPO

B1 / Lang

编程语言Languages

Python
C++
Java
Bash

B2 / Vision

计算机视觉Computer Vision

YOLOv8
SAHI
OpenCV
I3D

B3 / Infra

基础设施与工具Infra & Tooling

Ascend NPU
GCP
FastAPI
Git
Ubuntu
SQLite

/03

工作经历Experience

从大模型推理框架与跨芯片精度验证,到规模化训练,再到检测系统落地与科研。From inference frameworks and cross-chip precision validation, to large-scale training, to shipping detection systems and research.

推理框架实习生 · AI InfraInference-Framework Intern · AI Infra
腾讯 · 混元基础组Tencent · Hunyuan Foundation Group 2026 年 4 月 — 至今Apr 2026 — Present
负责面向大模型的 跨芯片(昇腾 / AMD / NVIDIA)算子精度验证统一框架 的设计与开发,以及 vLLM 推理加速框架的加速特性开发与适配。
- 以混元 3 大模型为基础模型,把算子精度验证统一框架接入不同卡型对应的 vLLM 适配与加速 CI/CD。
- 让它作为第一道防线,验证算子在不同卡型下的精度表现,降低上线问题与精度问题。
Building a unified cross-chip (Ascend / AMD / NVIDIA) operator-precision verification framework for large models, plus acceleration-feature development and adaptation for the vLLM inference stack.
- Built on the Hunyuan-3 base models, plugged the precision-verification framework into the per-card-type vLLM adaptation & acceleration CI/CD.
- It acts as a first line of defense — validating operator precision across card types to cut production and precision regressions.
软件工程师(OD) · 昇腾模型中台Software Engineer (OD) · Ascend Model Platform
华为技术有限公司Huawei Technologies 2024 年 8 月 — 2025 年 9 月Aug 2024 — Sep 2025
主要负责 多模态大模型的训练与推理加速 — 跨芯片适配(GPU → 昇腾 NPU)、序列并行,以及多 NPU 集群上的全参强化学习训练。
- 设计并实现长序列并行特性,使训练规模下降 75%、显存节省 50%,支持相同硬件下更长序列的训练。
- 主导基于 TRL 和 verl 的 GRPO 强化学习栈从 GPU 到昇腾 NPU 的迁移,负责训练、推理、调优、部署全链路的技术排查与方案设计。
- 搭建并维护高效的开发 / 测试 / 部署环境,覆盖 NPU 与 GPU。
Focused on multimodal LLM training and inference acceleration — cross-chip adaptation (GPU → Ascend NPU), sequence parallelism, and full-parameter RL training on multi-NPU clusters.
- Designed and implemented a long-sequence parallel feature that cut training scale by 75% and memory by 50%, unlocking longer sequences on the same hardware.
- Led GPU → Ascend NPU porting of GRPO RL stacks built on TRL and verl, owning end-to-end debugging across training, inference, tuning, and deployment.
- Built and maintained efficient dev / test / deployment environments spanning NPU and GPU.
机器学习工程师实习生Machine Learning Engineer Intern
Moii.AI 2023 年 8 月 — 2023 年 12 月Aug 2023 — Dec 2023

设计并开发用于 实时威胁检测 的机器学习推理 API,支持图片、视频和实时视频流。训练 YOLOv8,并优化 SAHI 的自适应切片大小算法,用于小物体检测 — 准确率 +30%,推理速度 +200%。

与开发团队定期 code review,确定系统架构、功能实现、性能与成本。

Designed and built ML inference APIs for real-time threat detection on images, videos, and live streams. Trained YOLOv8 and tuned SAHI’s adaptive tile sizing for small-object detection — accuracy +30%, inference +200%.

Collaborated with the team through regular code reviews on system architecture, performance, and cost.
研究助理(RA)Research Assistant (RA)
密歇根州立大学Michigan State University 2023 年 5 月 — 2023 年 8 月May 2023 — Aug 2023
研究 多模态模型 在医学生考试自动评分、暴力检测等场景下的应用。
- 主导医学生实践考核自动评分项目:训练 SOTA 弱监督时序动作定位(WTAL) 多模态模型,并引入文本信息提升性能。
- 构造了 1.2 TB、近 9000 个视频 的数据集用于训练;模型评分准确率达 92.75%,ROC-AUC 达 94.88%。
Researched multimodal models for medical-exam auto-grading, violence detection, and related scenarios.
- Led an auto-scoring project for medical students’ practical exams: trained a SOTA weakly-supervised temporal action localization (WTAL) multimodal model, adding text cues to boost performance.
- Built a 1.2 TB, ~9,000-video dataset; the model reached 92.75% accuracy and 94.88% ROC-AUC.

/04

教育背景Education

两个校园,两个国家,一条主线。Two campuses, two countries, one through-line.

慕尼黑工业大学(TUM)Technical University of Munich

信息学硕士(M.Sc. Informatics)M.Sc. Informatics

2025 — 2027 · 德国慕尼黑2025 — 2027 · Munich, Germany

QS 世界大学排名全球 前 25。研究方向:机器学习、分布式系统、计算机视觉。 QS World University Rankings: Top 25 globally. Coursework and research focused on machine learning, distributed systems, and computer vision.

密歇根州立大学Michigan State University

计算机科学学士 · 辅修商务B.Sc. Computer Science · Minor: Business

2019 — 2023 · 美国密歇根州东兰辛2019 — 2023 · East Lansing, MI, USA

GPA 3.892 / 4.00(前 10%)· 专业 GPA 3.924。院长嘉许名单 × 6。 GPA 3.892 / 4.00 (Top 10%) · Major GPA 3.924. Dean’s List × 6.

/05

论文发表Publications

已发表第一作者论文,后续仍有筹备中作品。First-author work, with more in the pipeline.

2024 · 第一作者First Author

Enhanced DeblurGAN: An advanced combinatorial model for motion blur removal in low-light photography

Applied and Computational Engineering, 20–25

查看论文Read paper DOI 10.54254/2755-2721/51/20241152

/06

精选项目Selected work

三个项目,覆盖模型加速、应用 CV、多模态研究三个方向。Three projects spanning model acceleration, applied CV, and multimodal research.

分布式 · 多模态 · 文生视频Distributed · Multimodal · Text-to-Video

Wan2.1 14B 长序列并行训练Wan2.1 14B Long-Sequence Parallel

京东联合创新项目。基于 ModelScope DiffSynth-Studio,在昇腾集群上完成 Wan2.1 14B 文生视频模型的适配与显存 profiling,实现 DeepSpeed-Ulysses 序列并行策略 — 32K 序列训练从 32 卡降到 8 卡,节点数 − 75%、显存 − 50%。 Joint innovation with JD. Memory-profiled the Wan2.1 14B text-to-video model on Ascend clusters and implemented a DeepSpeed-Ulysses sequence-parallel strategy — 32K-sequence training on 8 NPUs instead of 32, cutting nodes by 75% and memory by 50%.

DeepSpeed-Ulysses
Ascend NPU
DiffSynth-Studio

Real-time firearm detection screenshot / 实时枪支检测系统截图

计算机视觉 · FastAPI · GCPComputer Vision · FastAPI · GCP

实时枪支检测与预警系统Real-time Firearm Detection

网站允许用户接入摄像头实时流,后端实时检测枪支等小目标。YOLOv8 + SAHI — 准确率 +30%,推理 +200%。检测到威胁立即邮件告警。获得 Amazon Sigma Award(30 支团队第一)。 Web system that plugs into live camera streams for real-time detection of small objects like firearms. YOLOv8 + SAHI — accuracy +30%, inference +200%. Emergency emails on threat detection. Won the Amazon Sigma Award (1st of 30 teams).

YOLOv8
SAHI
FastAPI
Google Cloud

弱监督时序动作定位 · 多模态 · 科研Weakly-Supervised TAL · Multimodal · Research

面向医学考核的多模态动作识别Multimodal WTAL for Medical Exams

本科生科研项目。基于 SOTA 弱监督时序动作定位(WTAL)多模态模型,引入文本特征提升识别效果,自动为医学生实践考核打分。用 I3D 从约 9,000 个视频(> 1.2 TB)提取光流和 RGB 特征,4× RTX A6000 训练 — 92.75% 准确率、94.88% ROC-AUC。 Undergraduate research. SOTA weakly-supervised temporal action localization with text cues to score medical students’ practical exams. Extracted optical-flow + RGB features from ~9,000 videos (> 1.2 TB) using I3D, trained on 4× RTX A6000 — 92.75% accuracy and 94.88% ROC-AUC.

I3D
WTAL
4× A6000

/07

联系方式Get in touch

最适合的话题:实习机会、研究合作,或只是打个招呼。Best for internship intros, research collaborations, or just to say hi.