Introduction¶
What is PyImport?¶
PyImport is a powerful Python command-line tool for importing CSV data into MongoDB. Unlike MongoDB’s native mongoimport, PyImport focuses on:
Handling “dirty” data gracefully - Automatic type conversion with fallback to strings on errors
Automatic field type detection - Infers types from CSV data and generates field files
Multiple execution strategies - Sync, async, multi-process, and threaded imports
Parallel processing - Split large files and import in parallel for maximum throughput
Restart capability - Resume failed imports from where they left off
Flexible type conversion - Support for dates, timestamps, ISO dates, and custom formats
Key Features¶
Automatic Type Detection¶
PyImport can analyze your CSV file and automatically generate a field file (.tff) that defines the type of each column:
pyimport --genfieldfile data.csv
This creates data.tff with inferred types for each field.
Forgiving Type Conversion¶
If a field value cannot be converted to its specified type, PyImport falls back to storing it as a string rather than failing the entire import. This handles messy real-world data gracefully.
Multiple Import Strategies¶
Synchronous (default): Single-threaded, straightforward imports
Async (
--asyncpro): Event-loop based async imports using Motor driverMulti-process (
--multi): Parallel import using multiple CPU coresThreaded (
--threads): Thread-based parallel import
Performance Optimizations¶
Recent performance improvements include:
Pre-compiled type converters (15-25% faster)
Optimized field validation (5-10% faster)
Fast ISO date parsing (100x faster than generic date parsing)
Expected overall improvement: 20-35%
Typical throughput:
Sync: ~24,000-32,000 docs/sec
Async: ~30,000-40,000 docs/sec
Multi-process: ~50,000+ docs/sec
File Splitting¶
Large CSV files can be automatically split into smaller chunks for parallel processing:
pyimport --splitfile --autosplit 10 --multi --poolsize 4 largefile.csv
This splits the file into 10 chunks and processes them with 4 parallel workers.
When to Use PyImport¶
PyImport is ideal when you need to:
Import CSV data with inconsistent types or dirty data
Automatically detect and handle various date formats
Import large files quickly using parallel processing
Resume failed imports without starting over
Add metadata (timestamps, filenames, line numbers) to imported documents
Import from URLs or local files with the same command
Quick Example¶
Basic import:
# Generate field file
pyimport --genfieldfile mydata.csv
# Import with generated field file
pyimport --database mydb --collection mycol mydata.csv
Fast parallel import:
pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
--database mydb --collection mycol largefile.csv
Architecture Overview¶
PyImport follows a clean architecture:
Field Files (
.tff): TOML files defining column types and formatsCSV Reader: Reads and type-converts CSV data per field definitions
Enricher: Optionally adds metadata (timestamps, filenames, line numbers)
Database Writer: Batches and writes documents to MongoDB (or PostgreSQL)
Import Commands: Orchestrates the import process using different strategies
Compared to mongoimport¶
Feature |
PyImport |
mongoimport |
|---|---|---|
Type inference |
Automatic with |
Manual |
Dirty data handling |
Graceful fallback to string |
Strict, may fail |
Date formats |
Multiple formats, automatic detection |
Limited |
Parallel processing |
Built-in with |
Requires external scripting |
Restart capability |
Progress tracking infrastructure (in development) |
Not built-in |
CSV from URLs |
Yes |
No |
File splitting |
Built-in |
Manual |
Installation¶
See the Installation Guide for setup instructions.
Next Steps¶
Installation - Set up PyImport
Quick Start - Get started with basic imports
Command-Line Reference - Complete CLI options documentation
Field Files - Understanding
.tfffield filesAdvanced Usage - Parallel processing, restart, and optimization