HZ.
聊一聊Let’s talk
现在 @ 腾讯混元 · 推理框架实习Now @ Tencent Hunyuan · Inference Framework

庄宏。Hong Zhuang.

机器学习工程师 · 多模态大模型训练与加速 Machine Learning Engineer · Multimodal LLM Training & Acceleration

慕尼黑工业大学(TUM)信息学硕士在读 · 此前在华为昇腾模型中台主导长序列并行训练优化,并在 Moii.AI 构建实时小物体检测系统。 MSc Informatics @ Technical University of Munich · previously building long-sequence parallel training at Huawei Ascend and real-time small-object detection at Moii.AI.

  • 0% GPU 节省(Wan2.1)GPU saved (Wan2.1)
  • 0% 显存下降memory cut
  • +0% 检测准确率CV accuracy
  • +0% 推理速度inference speed
Hong Zhuang portrait / 庄宏头像 德国 慕尼黑Munich, Germany
PyTorch · vLLM TRL · VERL DeepSpeed-Ulysses
/01

关于我About

先快速了解一下:我现在在哪、在做什么、下一步在找什么。A quick snapshot of where I am, what I work on, and what I’m looking for next.

你好,我是 庄宏慕尼黑工业大学(TUM)信息学硕士在读。我专注于大规模 多模态与强化学习模型 的训练、扩展与加速。

目前在 腾讯混元基础组(AI Infra) 做推理框架实习,负责跨芯片算子精度验证统一框架与 vLLM 推理加速。此前在 华为昇腾模型中台 主导了 Wan2.1 14B 文生视频模型的长序列并行优化(与京东联合创新)— GPU 用量降低 75%,显存下降 50%;更早在 Moii.AI 参与构建实时枪支检测系统,获得 Amazon Sigma Award

兴趣方向:模型工程、算法落地、模型加速、推理框架

Hi, I’m Hong Zhuang — an MSc Informatics student at the Technical University of Munich (TUM). I work on training, scaling, and accelerating large multimodal and reinforcement-learning models.

Currently interning on Tencent’s Hunyuan Foundation Group (AI Infra), building a cross-chip operator-precision verification framework and vLLM inference acceleration. Previously at Huawei’s Ascend Model Platform, I drove long-sequence parallel optimization for the Wan2.1 14B text-to-video model (a joint innovation with JD) — cutting GPU usage by 75% and memory by 50%; and earlier co-built a real-time firearm detection system at Moii.AI that won the Amazon Sigma Award.

Interests: model engineering, algorithm deployment, model acceleration, and inference frameworks.

/02

专业技能Skills

Bento 视角的技术栈速览 — 按日常用途分组。A bento snapshot — favorite tools grouped by what they get me through the day.

A1 / Core

AI 框架与训练AI Frameworks & Training

从研究原型到多 NPU 全参强化学习微调的日常主力。Daily driver for everything from quick research prototypes to full-parameter multi-NPU RL fine-tuning.

  • PyTorch
  • HuggingFace
  • LLaMA-Factory
  • DiffSynth-Studio
  • Pandas
A2 / Acceleration

分布式与推理Distributed & Inference

长序列并行、PagedAttention,以及现代 LLM 推理服务栈。Long-sequence parallelism, paged attention, and the rest of the modern LLM serving stack.

  • DeepSpeed-Ulysses
  • vLLM
  • Sequence Parallel
A3 / RL

强化学习微调RL Fine-tuning

GRPO 类强化学习微调从 GPU 迁移到昇腾 NPU — 训练、推理、调优、部署全链路。GRPO-style RL fine-tuning ported from GPU to Ascend NPU — end-to-end training, debugging, and deployment.

  • TRL
  • VERL
  • GRPO
B1 / Lang

编程语言Languages

  • Python
  • C++
  • Java
  • Bash
B2 / Vision

计算机视觉Computer Vision

  • YOLOv8
  • SAHI
  • OpenCV
  • I3D
B3 / Infra

基础设施与工具Infra & Tooling

  • Ascend NPU
  • GCP
  • FastAPI
  • Git
  • Ubuntu
  • SQLite
/03

工作经历Experience

从大模型推理框架与跨芯片精度验证,到规模化训练,再到检测系统落地与科研。From inference frameworks and cross-chip precision validation, to large-scale training, to shipping detection systems and research.

  1. 推理框架实习生 · AI InfraInference-Framework Intern · AI Infra

    腾讯 · 混元基础组Tencent · Hunyuan Foundation Group 2026 年 4 月 — 至今Apr 2026 — Present

    负责面向大模型的 跨芯片(昇腾 / AMD / NVIDIA)算子精度验证统一框架 的设计与开发,以及 vLLM 推理加速框架的加速特性开发与适配。

    • 以混元 3 大模型为基础模型,把算子精度验证统一框架接入不同卡型对应的 vLLM 适配与加速 CI/CD
    • 让它作为第一道防线,验证算子在不同卡型下的精度表现,降低上线问题与精度问题。

    Building a unified cross-chip (Ascend / AMD / NVIDIA) operator-precision verification framework for large models, plus acceleration-feature development and adaptation for the vLLM inference stack.

    • Built on the Hunyuan-3 base models, plugged the precision-verification framework into the per-card-type vLLM adaptation & acceleration CI/CD.
    • It acts as a first line of defense — validating operator precision across card types to cut production and precision regressions.
  2. 软件工程师(OD) · 昇腾模型中台Software Engineer (OD) · Ascend Model Platform

    华为技术有限公司Huawei Technologies 2024 年 8 月 — 2025 年 9 月Aug 2024 — Sep 2025

    主要负责 多模态大模型的训练与推理加速 — 跨芯片适配(GPU → 昇腾 NPU)、序列并行,以及多 NPU 集群上的全参强化学习训练。

    • 设计并实现长序列并行特性,使训练规模下降 75%、显存节省 50%,支持相同硬件下更长序列的训练。
    • 主导基于 TRLverl 的 GRPO 强化学习栈从 GPU 到昇腾 NPU 的迁移,负责训练、推理、调优、部署全链路的技术排查与方案设计。
    • 搭建并维护高效的开发 / 测试 / 部署环境,覆盖 NPU 与 GPU。

    Focused on multimodal LLM training and inference acceleration — cross-chip adaptation (GPU → Ascend NPU), sequence parallelism, and full-parameter RL training on multi-NPU clusters.

    • Designed and implemented a long-sequence parallel feature that cut training scale by 75% and memory by 50%, unlocking longer sequences on the same hardware.
    • Led GPU → Ascend NPU porting of GRPO RL stacks built on TRL and verl, owning end-to-end debugging across training, inference, tuning, and deployment.
    • Built and maintained efficient dev / test / deployment environments spanning NPU and GPU.
  3. 机器学习工程师实习生Machine Learning Engineer Intern

    Moii.AI 2023 年 8 月 — 2023 年 12 月Aug 2023 — Dec 2023

    设计并开发用于 实时威胁检测 的机器学习推理 API,支持图片、视频和实时视频流。训练 YOLOv8,并优化 SAHI 的自适应切片大小算法,用于小物体检测 — 准确率 +30%,推理速度 +200%

    与开发团队定期 code review,确定系统架构、功能实现、性能与成本。

    Designed and built ML inference APIs for real-time threat detection on images, videos, and live streams. Trained YOLOv8 and tuned SAHI’s adaptive tile sizing for small-object detection — accuracy +30%, inference +200%.

    Collaborated with the team through regular code reviews on system architecture, performance, and cost.

  4. 研究助理(RA)Research Assistant (RA)

    密歇根州立大学Michigan State University 2023 年 5 月 — 2023 年 8 月May 2023 — Aug 2023

    研究 多模态模型 在医学生考试自动评分、暴力检测等场景下的应用。

    • 主导医学生实践考核自动评分项目:训练 SOTA 弱监督时序动作定位(WTAL) 多模态模型,并引入文本信息提升性能。
    • 构造了 1.2 TB、近 9000 个视频 的数据集用于训练;模型评分准确率达 92.75%,ROC-AUC 达 94.88%

    Researched multimodal models for medical-exam auto-grading, violence detection, and related scenarios.

    • Led an auto-scoring project for medical students’ practical exams: trained a SOTA weakly-supervised temporal action localization (WTAL) multimodal model, adding text cues to boost performance.
    • Built a 1.2 TB, ~9,000-video dataset; the model reached 92.75% accuracy and 94.88% ROC-AUC.
/04

教育背景Education

两个校园,两个国家,一条主线。Two campuses, two countries, one through-line.

慕尼黑工业大学(TUM)Technical University of Munich

信息学硕士(M.Sc. Informatics)M.Sc. Informatics

2025 — 2027 · 德国 慕尼黑2025 — 2027 · Munich, Germany

QS 世界大学排名全球 前 25。研究方向:机器学习、分布式系统、计算机视觉。 QS World University Rankings: Top 25 globally. Coursework and research focused on machine learning, distributed systems, and computer vision.

密歇根州立大学Michigan State University

计算机科学学士 · 辅修商务B.Sc. Computer Science · Minor: Business

2019 — 2023 · 美国密歇根州 东兰辛2019 — 2023 · East Lansing, MI, USA

GPA 3.892 / 4.00(前 10%)· 专业 GPA 3.924。院长嘉许名单 × 6。 GPA 3.892 / 4.00 (Top 10%) · Major GPA 3.924. Dean’s List × 6.

/05

论文发表Publications

已发表第一作者论文,后续仍有筹备中作品。First-author work, with more in the pipeline.

/06

精选项目Selected work

三个项目,覆盖模型加速、应用 CV、多模态研究三个方向。Three projects spanning model acceleration, applied CV, and multimodal research.

分布式 · 多模态 · 文生视频Distributed · Multimodal · Text-to-Video

Wan2.1 14B 长序列并行训练Wan2.1 14B Long-Sequence Parallel

京东联合创新项目。基于 ModelScope DiffSynth-Studio,在昇腾集群上完成 Wan2.1 14B 文生视频模型的适配与显存 profiling,实现 DeepSpeed-Ulysses 序列并行策略 — 32K 序列训练从 32 卡降到 8 卡,节点数 − 75%、显存 − 50% Joint innovation with JD. Memory-profiled the Wan2.1 14B text-to-video model on Ascend clusters and implemented a DeepSpeed-Ulysses sequence-parallel strategy — 32K-sequence training on 8 NPUs instead of 32, cutting nodes by 75% and memory by 50%.

  • DeepSpeed-Ulysses
  • Ascend NPU
  • DiffSynth-Studio
Real-time firearm detection screenshot / 实时枪支检测系统截图
计算机视觉 · FastAPI · GCPComputer Vision · FastAPI · GCP

实时枪支检测与预警系统Real-time Firearm Detection

网站允许用户接入摄像头实时流,后端实时检测枪支等小目标。YOLOv8 + SAHI — 准确率 +30%,推理 +200%。检测到威胁立即邮件告警。获得 Amazon Sigma Award(30 支团队第一)。 Web system that plugs into live camera streams for real-time detection of small objects like firearms. YOLOv8 + SAHI — accuracy +30%, inference +200%. Emergency emails on threat detection. Won the Amazon Sigma Award (1st of 30 teams).

  • YOLOv8
  • SAHI
  • FastAPI
  • Google Cloud
弱监督时序动作定位 · 多模态 · 科研Weakly-Supervised TAL · Multimodal · Research

面向医学考核的多模态动作识别Multimodal WTAL for Medical Exams

本科生科研项目。基于 SOTA 弱监督时序动作定位(WTAL)多模态模型,引入文本特征提升识别效果,自动为医学生实践考核打分。用 I3D 从约 9,000 个视频(> 1.2 TB)提取光流和 RGB 特征,4× RTX A6000 训练 — 92.75% 准确率、94.88% ROC-AUC。 Undergraduate research. SOTA weakly-supervised temporal action localization with text cues to score medical students’ practical exams. Extracted optical-flow + RGB features from ~9,000 videos (> 1.2 TB) using I3D, trained on 4× RTX A6000 — 92.75% accuracy and 94.88% ROC-AUC.

  • I3D
  • WTAL
  • 4× A6000
/07

联系方式Get in touch

最适合的话题:实习机会、研究合作,或只是打个招呼。Best for internship intros, research collaborations, or just to say hi.