AI/ML Expert Core Framework Guidelines PyTorch When reviewing or writing PyTorch code, apply these guidelines: Use torch.nn.Module for all model definitions; avoid raw function-based models Move tensors and models to the correct device explicitly: model.to(device) , tensor.to(device) Use model.train() and model.eval() context switches appropriately Accumulate gradients with optimizer.zero_grad() at the top of the training loop Use torch.no_grad() or @torch.inference_mode() for all inference code Pin memory ( pin_memory=True ) and use multiple workers in DataLoader for GPU training Use torch.compile() (PyTorch 2.x) for production inference speedups Prefer F.cross_entropy over manual softmax + NLLLoss (numerically stable) TensorFlow / Keras When reviewing or writing TensorFlow code, apply these guidelines: Use the Keras functional API or subclassing API; avoid Sequential for complex models Prefer tf.data.Dataset pipelines over manual batching for scalability Use tf.function for graph execution on performance-critical paths Apply mixed precision training: tf.keras.mixed_precision.set_global_policy('mixed_float16') Use tf.saved_model for portable model export; avoid pickling Hugging Face Transformers When reviewing or writing Hugging Face code, apply these guidelines: Always use the tokenizer associated with the model checkpoint Set padding=True and truncation=True when tokenizing batches Use AutoModel , AutoTokenizer , and AutoConfig for checkpoint portability Apply model.gradient_checkpointing_enable() to reduce memory for large models Use Trainer API for standard fine-tuning; use custom loops only when Trainer is insufficient Cache models with TRANSFORMERS_CACHE environment variable in CI/CD pipelines scikit-learn When reviewing or writing scikit-learn code, apply these guidelines: Use Pipeline to chain preprocessing and model steps; prevents data leakage Use StratifiedKFold for classification tasks with class imbalance Prefer GridSearchCV or RandomizedSearchCV for hyperparameter tuning Always call .fit() only on training data; transform test data with the fitted transformer Serialize models with joblib.dump / joblib.load (faster than pickle for large arrays) LLM Integration Patterns Prompt Engineering Structure prompts with a clear system message, context, and user instruction Use few-shot examples in the system prompt for consistent output formatting Apply chain-of-thought prompting ( "Think step by step..." ) for complex reasoning tasks Set temperature=0 for deterministic, fact-based outputs; increase for creative tasks Manage token budgets explicitly: estimate prompt tokens before sending Implement output parsing with structured formats (JSON mode, XML tags) RAG Pipelines
Standard RAG pipeline components
from langchain . embeddings import HuggingFaceEmbeddings from langchain . vectorstores import FAISS
or Chroma, Pinecone, Weaviate
from langchain . chains import RetrievalQA
1. Embed and index documents
embeddings
HuggingFaceEmbeddings ( model_name = "sentence-transformers/all-mpnet-base-v2" ) vectorstore = FAISS . from_documents ( documents , embeddings )
2. Retrieve relevant chunks
retriever
vectorstore . as_retriever ( search_kwargs = { "k" : 4 } )
3. Generate with retrieved context
chain
RetrievalQA . from_chain_type ( llm = llm , retriever = retriever ) RAG best practices: Chunk documents at natural boundaries (paragraphs, sections), not fixed character counts Use hybrid retrieval: combine dense embeddings with sparse BM25 for better recall Implement semantic caching for repeated queries to reduce latency and cost Validate retrieved context relevance before passing to the LLM Store metadata alongside embeddings for filtering (date, source, author) LangChain / LangGraph Use LCEL (LangChain Expression Language) for composable chains Apply RunnableParallel for concurrent retrieval steps Use LangGraph for stateful multi-agent workflows with cycles Implement retry logic with RunnableRetry for unreliable external calls Trace and evaluate chains with LangSmith in development Training Loop Standards
Standard PyTorch training loop with best practices
for epoch in range ( num_epochs ) : model . train ( ) for batch in train_dataloader : optimizer . zero_grad ( ) inputs , labels = batch [ "input_ids" ] . to ( device ) , batch [ "labels" ] . to ( device ) outputs = model ( inputs ) loss = criterion ( outputs , labels ) loss . backward ( ) torch . nn . utils . clip_grad_norm_ ( model . parameters ( ) , max_norm = 1.0 )
gradient clipping
optimizer . step ( ) scheduler . step ( )
Validation loop
model . eval ( ) with torch . no_grad ( ) : for batch in val_dataloader :
evaluate...
Key standards: Proper train/validation/test splits: 80/10/10 or stratified for imbalanced datasets Gradient clipping ( max_norm=1.0 ) for stability in Transformer training Learning rate scheduling: cosine annealing with warmup for Transformers Early stopping based on validation loss, not training loss Checkpoint the best model by validation metric, not the final epoch Fine-Tuning Standards Full Fine-Tuning Reduce learning rate 10-100x compared to training from scratch Freeze early layers; fine-tune upper layers and task head first Use discriminative learning rates: lower LR for frozen layers, higher for new layers Apply label smoothing ( smoothing=0.1 ) to reduce overconfidence Parameter-Efficient Fine-Tuning (PEFT) from peft import LoraConfig , get_peft_model , TaskType lora_config = LoraConfig ( task_type = TaskType . CAUSAL_LM , r = 16 ,
LoRA rank
lora_alpha
32 ,
scaling factor
target_modules
[ "q_proj" , "v_proj" ] , lora_dropout = 0.05 , ) model = get_peft_model ( base_model , lora_config ) model . print_trainable_parameters ( )
verify < 1% parameters trainable
PEFT guidelines: Use LoRA rank r=8 to r=64 ; higher rank = more capacity, more memory QLoRA (4-bit quantization + LoRA) for fine-tuning 7B+ models on consumer GPUs Merge adapter weights before serving to eliminate inference overhead Prefer adapter-based methods over full fine-tuning for limited data (< 10K examples) MLOps and Experiment Tracking MLflow import mlflow with mlflow . start_run ( ) : mlflow . log_params ( { "learning_rate" : lr , "batch_size" : bs , "epochs" : epochs } ) mlflow . log_metrics ( { "train_loss" : loss , "val_accuracy" : acc } , step = epoch ) mlflow . pytorch . log_model ( model , "model" ) Weights & Biases import wandb wandb . init ( project = "my-project" , config = { "lr" : 1e-4 , "epochs" : 10 } ) wandb . log ( { "train_loss" : loss , "val_f1" : f1_score } ) wandb . finish ( ) MLOps standards: Log every hyperparameter and dataset version before training starts Track system metrics (GPU utilization, memory, throughput) alongside model metrics Version datasets with DVC or Delta Lake; never overwrite raw data Use reproducible seeds: torch.manual_seed(42) , np.random.seed(42) , random.seed(42) Register production models in a model registry with stage gates (Staging → Production) Model Evaluation Standards Metrics by Task Type Task Primary Metrics Secondary Metrics Binary Classification AUC-ROC, F1, Precision/Recall Calibration (Brier Score) Multi-class Macro F1, Weighted F1, Cohen's Kappa Confusion Matrix Regression RMSE, MAE, R² Residual Analysis NLP Generation BLEU, ROUGE, BERTScore Human Evaluation Ranking/Retrieval NDCG@k, MRR, MAP Hit Rate@k LLM Evaluation LLM-as-judge, exact match, pass@k Hallucination Rate Evaluation Best Practices Never tune hyperparameters on the test set; use a held-out validation set Report confidence intervals (bootstrap or cross-validation) for all metrics Disaggregate metrics by subgroup for fairness analysis Use statistical significance tests (McNemar, paired t-test) when comparing models Establish a simple baseline before reporting model results Production ML Systems Model Deployment Export to ONNX for cross-platform inference: torch.onnx.export(model, ...) Use TorchServe, Triton Inference Server, or BentoML for serving Apply quantization for CPU deployment: torch.quantization.quantize_dynamic(model, ...) Set up batching with a maximum batch size and timeout for throughput vs latency tradeoffs Use model warming (pre-load and dummy inference) to eliminate cold-start latency Monitoring and Drift Detection
Example: data drift detection with Evidently
from evidently . report import Report from evidently . metric_preset import DataDriftPreset report = Report ( metrics = [ DataDriftPreset ( ) ] ) report . run ( reference_data = reference_df , current_data = production_df ) report . save_html ( "drift_report.html" ) Monitoring standards: Track feature distribution drift (KS test, PSI) on a daily schedule Alert on prediction distribution shift (concept drift) Log and sample model inputs/outputs for downstream evaluation Implement shadow mode (run new model alongside production, compare outputs) Define retraining triggers based on drift thresholds, not fixed schedules Data Preprocessing Standards
Proper train/test split to avoid leakage
from sklearn . model_selection import train_test_split X_train , X_test , y_train , y_test = train_test_split ( X , y , test_size = 0.2 , random_state = 42 , stratify = y
stratify for classification
)
Fit scaler ONLY on training data
from sklearn . preprocessing import StandardScaler scaler = StandardScaler ( ) X_train_scaled = scaler . fit_transform ( X_train ) X_test_scaled = scaler . transform ( X_test )
transform only, never fit_transform
Standards: Separate preprocessing pipeline per data modality (text, image, tabular) Validate schema and types before entering the pipeline Handle missing values with domain-aware strategies (median, mode, forward-fill) Detect and document outliers; do not silently remove them Apply augmentation only to training data, never validation or test data Iron Laws ALWAYS fix random seeds and log all hyperparameters before training — non-reproducible experiments cannot be shared, audited, or debugged; use torch.manual_seed(42) , np.random.seed(42) , random.seed(42) and log via MLflow/W&B. NEVER fit preprocessing transformers on test data — fit only on training data, then .transform() test; fitting on test causes data leakage and inflated performance estimates. ALWAYS evaluate with multiple metrics aligned to business goals — never report accuracy alone on imbalanced datasets; use F1, precision-recall curve, and ROC-AUC at minimum. NEVER tune hyperparameters on the test set — use a held-out validation set for tuning; the test set is a one-time final evaluation only. ALWAYS establish a simple baseline before reporting model results — a heuristic or random baseline is mandatory; without it, model quality cannot be assessed. Anti-Patterns Anti-Pattern Problem Fix Ignoring class imbalance Model biased to majority class Stratified sampling, class weights, SMOTE No validation set Overfitting undetected Hold out 10-20% for validation Optimizing a single metric Missing failure modes Multiple metrics (precision, recall, F1, AUC) No baseline comparison Cannot assess model quality Establish heuristic baseline before ML Accuracy on imbalanced data Misleading performance estimate Use F1, precision-recall curve, ROC-AUC Data leakage (test in train) Inflated performance estimates Fit on train only; transform test with fitted obj No error analysis Cannot improve strategically Analyze failure cases by error type Training without checkpoints Lost progress on failure Save best model by validation metric Mutable global random state Non-reproducible experiments Fix all seeds; log in experiment metadata Embedding model in application Cannot update model independently Serve model via API (REST, gRPC) No latency budget Inference too slow for production Profile and set SLO before deployment Training a Transformer classifier: from transformers import AutoTokenizer , AutoModelForSequenceClassification , Trainer , TrainingArguments tokenizer = AutoTokenizer . from_pretrained ( "distilbert-base-uncased" ) model = AutoModelForSequenceClassification . from_pretrained ( "distilbert-base-uncased" , num_labels = 3 ) def tokenize ( batch ) : return tokenizer ( batch [ "text" ] , padding = True , truncation = True , max_length = 512 ) dataset = dataset . map ( tokenize , batched = True ) training_args = TrainingArguments ( output_dir = "./results" , num_train_epochs = 3 , per_device_train_batch_size = 16 , evaluation_strategy = "epoch" , save_strategy = "epoch" , load_best_model_at_end = True , metric_for_best_model = "f1" , ) trainer = Trainer ( model = model , args = training_args , train_dataset = dataset [ "train" ] , eval_dataset = dataset [ "validation" ] , compute_metrics = compute_metrics , ) trainer . train ( ) Minimal RAG pipeline: from langchain . embeddings import OpenAIEmbeddings from langchain . vectorstores import Chroma from langchain . chains import RetrievalQA from langchain . chat_models import ChatOpenAI vectorstore = Chroma . from_documents ( docs , OpenAIEmbeddings ( ) ) retriever = vectorstore . as_retriever ( search_kwargs = { "k" : 5 } ) qa = RetrievalQA . from_chain_type ( ChatOpenAI ( model = "gpt-4o" ) , retriever = retriever ) answer = qa . run ( "What is the refund policy?" ) Assigned Agents This skill is used by: developer — Implements ML models, data pipelines, and LLM integrations researcher — Investigates novel architectures and evaluates research papers architect — Designs ML system architecture and deployment topology security-architect — Reviews data privacy, model security, and inference safety