Joining the Seed foundation model team. Focus on data pipeline, training efficiency, and scaling experiments. Strong Python + distributed training background needed.
When d_k is large, the dot products QĀ·Kįµ grow in magnitude ā their variance scales with d_k, pushing softmax into regions with very small gradients (saturation). Dividing by ād_k normalizes the variance back to ~1, keeping softmax in a stable gradient regime.
Scaled dot-product attention
Attention(Q,K,V)=softmax(dkāāQKā¤ā)V
Key insight: Concrete example: d_k = 64 ā without scaling, dot products have std ā 8; after scaling by 1/ā64 = 1/8, std ā 1.
02
Training & Optimization
03
Architecture Design
04
Training & Alignment
05
Inference & Deployment
20 questions across 5 categories. More coming soon.