All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Flash Attention 2 integration
- Paged Attention for improved memory efficiency
-
Code Organization:
- Extracted
make_factory_kwargs()andinit_lora_weights()utilities - Unified Registry implementation using ComponentRegistry
- Migrated all
__main__demo code to test files (7 new test files) - Added custom exception module with hierarchical exception types
- Extracted
-
Error Handling:
- Replaced broad
except Exceptionwith specific exception types - Improved API error logging and message handling
- Replaced broad
-
Type Annotations:
- Fixed
pad_token_idduplicate definition - Fixed
normalized_shapetuple type mismatches - Added missing type annotations in MoE module
- Fixed None handling in config utilities
- Fixed
-
Code Quality:
- Removed ~600 lines of demo code from source modules
- Preserved educational NumPy implementations for learning
- Added comprehensive test coverage for demo functionality
-
SFT (Supervised Fine-tuning):
SFTDatasetfor instruction tuning with input maskingSFTDataModulefor data loadingSFTTaskregistered as--task sftin CLI- Tests for all SFT components
-
DPO (Direct Preference Optimization):
DPODatasethandling chosen/rejected pairsDPODataModulefor preference data loadingDPOTaskwith reference model management and DPO loss- Registered as
--task dpoin CLI - Tests for all DPO components
-
Continuous Batching Engine (Serving):
src/llm/serving/engine.pywithContinuousBatchingEngineclass- Iteration-level scheduling via
SchedulerandSlotAllocator - Pre-allocated KV cache pool for efficient memory management
- Supports mixed prefill/decode batching with automatic padding
- Clean API: requires
modelandtokenizerinstances upfront src/llm/serving/scheduler.pywith FCFS scheduling logic
-
LoRA (Low-Rank Adaptation):
src/llm/core/lora.pywithLoRALinearclass for parameter-efficient fine-tuningapply_lora(),merge_lora(),get_lora_parameters()helper functions- Device/dtype handling for CUDA compatibility
- 17 tests covering training and weight merging
-
QLoRA (Quantized LoRA):
src/llm/core/qlora.pywithQLoRALinearclass- NF4 4-bit quantization for base weights (~4x memory reduction)
- LoRA adapters remain in fp16/bf16 for training stability
apply_qlora()andget_qlora_parameters()helpers
-
RoPE (Rotary Position Embedding):
src/llm/core/rope.pywithRotaryPositionEmbeddingclass- Linear, dynamic, and NTK-aware scaling methods for extended context
apply_rotary_pos_emb(),get_rope_scaling_factor()utilities- 15 tests
-
ALiBi (Attention with Linear Biases):
src/llm/core/alibi.pywithALiBiPositionBiasclassget_alibi_slopes(),build_alibi_bias()functions- Cached bias computation for efficiency
- 13 tests
-
Sliding Window Attention:
window_sizeparameter inscaled_dot_product_attention- Propagated through
MultiHeadAttention,TransformerBlock,DecoderModel - Reduces memory for long sequences by limiting attention scope
- 10 tests
-
KV Cache Optimization:
src/llm/core/kv_cache.pywithKVCacheclass for pre-allocated cache buffers- In-place updates during autoregressive generation (avoids O(n²) memory operations)
- Integrated into
MHA,TransformerBlock,DecoderModel - Factory method
KVCache.from_model_config()for easy instantiation - Backward compatible: legacy
past_key_valuetuple format still works
-
E2E Testing Infrastructure:
tests/e2e/directory with comprehensive pipeline teststest_training.py,test_sft.py,test_dpo.pytest_gradient_accumulation.py,test_resume_training.py- Advanced inference and callback tests
-
Documentation:
notebooks/quick_start.ipynbinteractive tutorial- Covers model building, training, inference, and advanced features
-
SDPA Refactoring:
- Consolidated
scaled_dot_product_attentionwrapper intosrc/llm/core/attn/sdpa.py - Refactored
MultiHeadAttentionandMultiLatentAttentionto use commonsdpawrapper - Archived custom implementation to
_learning/03_lab/experiments/custom_sdpa.py
- Consolidated
-
Test Suite Refactoring:
- Organized test files into subdirectories (
tests/training/,tests/inference/, etc.) - Converted to functional testing style (real components over mocks)
- Added shared fixtures in
tests/conftest.py - Test count: 385 → 432
- Organized test files into subdirectories (
-
TrainingEngine:
- Support for dictionary batches in training/validation loops
- Gradient accumulation implementation
-
DPO Reference Model:
- Use model reconstruction instead of
deepcopyfor ref_model creation
- Use model reconstruction instead of
-
Documentation:
- Added
docs/README.mdas documentation entry point - Added MkDocs Material configuration (
mkdocs.yml) for documentation site - Added GitHub Actions workflow for automatic GitHub Pages deployment
- Added
guide-finetuning.md(LoRA/QLoRA) andguide-inference.md(KVCache/GQA/Continuous Batching) - Enhanced
architecture.mdwith detailed component diagrams and data flow analysis - Updated ROADMAP Phase 10.2 (Continuous Batching complete)
- Added
-
Gradient Checkpointing:
- Memory-efficient training via
gradient_checkpointingparameter inDecoderModel enable_gradient_checkpointing()/disable_gradient_checkpointing()methods- Automatic incompatibility check with
use_cache=True
- Memory-efficient training via
-
E2E Pipeline Automation:
scripts/e2e_pipeline.pyfor automated Train → Evaluate → Inference workflowsrc/llm/utils/e2e.pywith reusable E2E core functions (E2EConfig,E2EResult,run_e2e_pipeline)- Rich progress UI and configurable CLI options
-
OpenAI-Compatible Chat API (
/v1/chat/completions):- Compatible with official OpenAI Python SDK
- Streaming and non-streaming chat completions
- Bearer token authentication support
- Multi-turn conversation handling
- 8 new test cases for compatibility layer
-
Batch Inference:
batch_generatefunction ininference.pywith left-padding and batched forward passBatchGenerationRequest/BatchGenerationResponseschemas/batch_generateAPI endpoint- 3 tests for batch inference (basic, single, empty)
-
Request Queue and Concurrency Control:
max_concurrent_requestsandrequest_timeoutinServingConfigasyncio.Semaphorefor concurrency limitingasyncio.timeoutfor request timeout handling (504 response)
-
CLI Entry Points:
llm-traincommand for training modelsllm-servecommand for starting inference server
-
Testing Infrastructure:
- Pytest markers using decorators:
quick,slow,heavy,e2e - MoE integration tests (6 tests for expert routing, gradient flow)
- E2E pipeline tests (full workflow, streaming consistency)
- Gradient checkpointing tests (8 tests)
- Total test count: 296 → 337
- Pytest markers using decorators:
-
Examples Directory:
inference_demo.pyfor basic text generationopenai_client_demo.pyfor OpenAI SDK usage
-
Documentation:
scripts/README.mddocumenting all available scripts- HFTokenizer example in
usage.md - Updated root
README.mdwith links to Examples and Scripts
-
Makefile Reorganization:
make testnow runs all tests by defaultmake test-fastfor daily development (excludes heavy/e2e)make test-quickfor rapid iteration (~6s)make test-covfor CI with coverage and allure reports- Removed redundant
test-allandtest-integration
-
CLI Standardization:
- CLI parameters changed from snake_case to kebab-case (
--file-path,--batch-size) - Replace
typerwithtyper-slim[standard]for reduced dependencies
- CLI parameters changed from snake_case to kebab-case (
-
Code Quality Improvements:
- Translate Chinese docstrings to English in serving module
- Remove ~75 lines of redundant comments
- Simplify section comments while preserving algorithm clarity
-
Documentation Refactoring:
- Eliminated redundancy between README, usage.md, and development.md
- Clear document responsibility separation
- Updated all docs to use new CLI commands
- Enhanced package metadata (keywords, classifiers)
-
Module Exports:
- Enhanced
llm/__init__.pywith public API exports (DecoderModel,generate, etc.) - Enhanced
llm.servingmodule exports (LLMEngine,ServingConfig, OpenAI schemas)
- Enhanced
- Removed obsolete TODO comment in
engine.py - Removed duplicate
num_kv_headsfield inModelConfig - Fixed MD051/link-fragments in
tutorial-cpu-llm.mdandfaq.md - Fixed
train.pytask registration forlmtask
-
Inference Serving:
- Production-ready REST API with FastAPI
- Streaming support via Server-Sent Events (SSE)
- Advanced sampling strategies (nucleus sampling/top-p, repetition penalty)
- Prometheus metrics endpoint for monitoring
- API key authentication (
X-API-Keyheader) - Structured logging with
python-json-logger - Real PyTorch model weights loading from checkpoint files
- Pickled tokenizer object loading support
-
Component Registry:
- Automatic component registration system (
ComponentRegistry) - Core components (MHA, MLP, MoE) auto-registered via side-effect imports
- Prevents "component not found" errors in simplified scripts
- Automatic component registration system (
-
Data Abstraction:
- Formalized
BaseTokenizerprotocol BaseDataModuleabstraction for flexible data handling- Environment variable configuration support (e.g.,
LLM_TRAINING__EPOCHS)
- Formalized
-
Testing & CLI:
--num-samplesflag intrain.pyfor rapid regression testing- Scheduler edge case tests (
test_scheduler_edge_cases.py) - Validation logging tests (
test_engine_logging.py) - Component registry tests (
test_init.py) - Model loading verification tests
- Auto-device detection in training scripts (prioritizes CUDA)
-
Documentation:
- Comprehensive usage guide (
docs/usage.md) - Architecture documentation (
docs/architecture.md) - Engineering documentation (ADRs, PR templates, FAQ)
- VS Code configuration and extensions
- Comprehensive usage guide (
-
Architecture Modernization:
- Migrated to Pydantic v2 (
BaseSettings,BaseModel) for configuration - Fully typed and validated configuration system
- CLI migration from
argparsetotyperfor better UX
- Migrated to Pydantic v2 (
-
Naming Standardization:
- Unified
ffn_hidden_size→intermediate_sizeacross codebase - Standardized input parameter
x→hidden_statesin forward methods - Applied to
MLP,LayerNorm,RMSNorm,DecoderModel,TransformerBlock - Updated all 309 tests to reflect API changes
- Unified
-
Code Quality:
- Standardized punctuation in documentation (full-width → half-width)
- Improved type hints and documentation comments
- Refactored
TransformerBlock.forwardfor clarity
-
Core Bugs:
CosineAnnealingLRT_maxcalculation whenepochs == warmup_epochs(ZeroDivisionError)TrainingEnginevalidation logging crash whengradient_normsis empty (IndexError)- PAD token generation issue in inference (logits masking)
SyntheticDataModuleprefetch_factorhandling withnum_workers=0TransformerBlockshared norm instance bug (independentnorm1/norm2)- Scheduler/optimizer step order warnings in tests
- PositionalEncoding support for
start_posin incremental generation - MLP SwiGLU operation order for numerical consistency
- Prompt truncation respecting
max_seq_lenwith new tokens - Auto AMP dtype resolution for CPU-only environments
-
Registry & Imports:
- Package auto-registration via
import llm - Component not found errors in simplified execution
- Package auto-registration via
-
Modern Architecture Features:
- Grouped Query Attention (GQA) for balanced performance and memory efficiency
- SwiGLU activation function in MLP layers
- Unified QKV projection optimization for improved memory layout and throughput
- RMSNorm support as alternative normalization layer
-
Tokenization & Training:
- BPETokenizer for production-ready subword tokenization
- LanguageModelingTask for language model training
- Automatic BF16/FP16 mixed precision detection and support
- Robust NaN loss handling
-
Inference Capabilities:
- KV Cache support in MHA, TransformerBlock, and DecoderModel
- Top-k and Top-p sampling strategies
- Greedy search decoding (temperature=0)
- Dynamic sequence length support
- Simple autoregressive generation loop
-
Testing & Quality:
- 262 comprehensive unit test cases covering all core functionality
- Functional tests for causal masking, KV cache consistency, architecture properties
- Convergence tests for training validation
- Mock-free test design using real components
-
Documentation:
- Comprehensive ROADMAP.md (405 lines) with 15 development stages
- Priority levels (P1-P4), timelines, and success metrics
- Detailed training framework documentation (8 comprehensive guides)
- CPU-friendly LLM tutorial and development guide
- FAQ document covering core topics
- ADR (Architecture Decision Records) system with 4 initial records
- PR template for standardized contributions
-
Architecture Optimization:
- Refactored DecoderModel with configurable components
- Optimized padding mask and KV cache handling
- Improved GradScaler usage for bfloat16
-
Training Enhancements:
- Enhanced TrainingEngine with improved callback system
- Performance monitoring and logging improvements
- Auto AMP dtype resolution for CPU-only environments
-
Code Quality:
- Enhanced Ruff linting rules (SIM, RUF, PTH for pathlib)
- PEP 561 compliance with py.typed marker
- Standardized punctuation across documentation
- Project structure improvements for modularity
-
Documentation:
- Updated Quick Start example from regression to lm task
- Enhanced feature descriptions with technical highlights
- Better cross-references and examples throughout
-
Core Issues:
- All 262 test regressions resolved
- PositionalEncoding support for
start_posin incremental generation - MLP SwiGLU operation order for numerical consistency
- Prompt truncation respecting
max_seq_lenwith new tokens - Device mismatch in MLP when norm instance provided
- Auto AMP dtype test failures on CUDA environments
-
Quality & Stability:
- Type checking issues across the codebase
- Memory management in distributed training
- Edge cases in attention masking and positional encoding
- Device comparisons robustness (comparing device.type)
- Failed runs on CPU-only environments
- Initial project setup with modern Python tooling (uv, hatchling)
- Basic Decoder-only Transformer architecture
- Multi-Head Attention (MHA) implementation
- Standard MLP with GELU activation
- SimpleCharacterTokenizer for basic experimentation
- Positional encoding (sinusoidal and learned)
- TrainingEngine with Distributed Data Parallel (DDP) support
- Automatic Mixed Precision (AMP) training
- Basic Mixture of Experts (MoE) implementation
- Core data loading and processing infrastructure
- BaseDataModule abstraction for flexible data handling
- pytest-based testing infrastructure
- CI/CD pipeline with GitHub Actions
- Code quality tools (ruff for linting/formatting, mypy for type checking)
- Pre-commit hooks for code quality enforcement
- Docker support for containerized development
- Comprehensive README and contributing guidelines