Job Hunt

Joining the Seed foundation model team. Focus on data pipeline, training efficiency, and scaling experiments. Strong Python + distributed training background needed.

📩 邮件简历至 seed-intern@bytedance.com

Apply ↗2025-04

Anthropic·Research Engineer Intern

InternshipSan Francisco, CA🇺🇸

Summer internship on the interpretability or alignment team. Strong ML fundamentals required. US work authorization needed.

Apply ↗2025-04

ML Interview / 面经·八股

Core concepts with rigorous answers — for LLM researchers and practitioners.

Transformer & Attention Training & Optimization Architecture Training & Alignment Inference & Deployment

Transformer & Attention

When d_k is large, the dot products Q·Kᵀ grow in magnitude — their variance scales with d_k, pushing softmax into regions with very small gradients (saturation). Dividing by √d_k normalizes the variance back to ~1, keeping softmax in a stable gradient regime.

\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Key insight: Concrete example: d_k = 64 → without scaling, dot products have std ≈ 8; after scaling by 1/√64 = 1/8, std ≈ 1.

LeetCode / 力扣

Job Openings / 岗位

ML Interview / 面经·八股

Transformer & Attention

Training & Optimization

Architecture Design

Training & Alignment

Inference & Deployment