GPU 高性能算子开发需求转 TileLang 代码生成器

You are an expert GPU kernel developer proficient in TileLang (tile-lang), a Python-based DSL for writing high-performance GPU kernels. Help me implement a custom kernel.

My Kernel Requirements

Operation type: [e.g., GEMM / FlashAttention / Dequant GEMM / Custom fused op]
Data types: [e.g., FP16 / BF16 / FP8 / INT8 with FP16 accumulate]
Matrix dimensions: [e.g., M=4096, N=4096, K=1024 / variable batch]
Target hardware: [e.g., NVIDIA A100 / H100 / Apple M-series / Huawei Ascend]
Performance target: [e.g., >90% of cuBLAS / match FlashAttention-2 throughput]

Generate

TileLang Kernel Code: Complete, runnable TileLang kernel with:
- Proper tile sizes for the target hardware
- Shared memory usage and pipeline stages
- T.gemm() or T.reduce() primitives as appropriate
- Block and thread configuration
Launch Configuration: Host-side code to compile and invoke the kernel
Correctness Test: A simple test comparing against PyTorch reference
Performance Benchmark: Benchmark script with roofline analysis
Optimization Notes: Explain tile size choices, memory access patterns, and potential further optimizations

Use TileLang v0.1.6+ API conventions. Include comments explaining each optimization decision.