Protocol: Data Preprocessing¶
Overview¶
This protocol describes the standard data preprocessing pipeline used in the lab.
Prerequisites¶
Environment Setup¶
# Create conda environment
conda create -n lab-env python=3.9
conda activate lab-env
pip install -r requirements.txt
Required Packages¶
Input¶
- Raw data format: [Format description]
- Location: [Data location]
- Expected file structure: [Describe structure]
Procedure¶
Step 1: Data Loading¶
import pandas as pd
# Load data
data = pd.read_csv('input.csv')
# Verify loading
print(f"Data shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")
Step 2: Data Cleaning¶
# Remove missing values
data = data.dropna()
# Remove duplicates
data = data.drop_duplicates()
# Check for outliers
# [Add outlier detection code]
Step 3: Data Transformation¶
# Normalize data
data['normalized'] = (data['value'] - data['value'].mean()) / data['value'].std()
# Log transform if needed
data['log_value'] = np.log(data['value'] + 1)
Step 4: Quality Control¶
# Check data quality
print(f"Missing values: {data.isnull().sum()}")
print(f"Data range: {data.describe()}")
Output¶
- Processed data format: [Format description]
- Location: [Output location]
- File naming: [Naming convention]
Code Standards¶
- Follow PEP 8 for Python code
- Document all functions
- Use version control
- Write unit tests when possible
Troubleshooting¶
| Issue | Solution |
|---|---|
| Memory errors with large files | Use chunking: pd.read_csv('file.csv', chunksize=10000) |
| Encoding errors | Specify encoding: pd.read_csv('file.csv', encoding='utf-8') |
| Missing data | Review data collection protocol, consider imputation |
Related Protocols¶
Last updated: December 18, 2025