Introduction¶

What is PyImport?¶

PyImport is a powerful Python command-line tool for importing CSV data into MongoDB. Unlike MongoDB’s native mongoimport, PyImport focuses on:

Handling “dirty” data gracefully - Automatic type conversion with fallback to strings on errors
Automatic field type detection - Infers types from CSV data and generates field files
Multiple execution strategies - Sync, async, multi-process, and threaded imports
Parallel processing - Split large files and import in parallel for maximum throughput
Restart capability - Resume failed imports from where they left off
Flexible type conversion - Support for dates, timestamps, ISO dates, and custom formats

Key Features¶

Automatic Type Detection¶

PyImport can analyze your CSV file and automatically generate a field file (.tff) that defines the type of each column:

pyimport --genfieldfile data.csv

This creates data.tff with inferred types for each field.

Forgiving Type Conversion¶

If a field value cannot be converted to its specified type, PyImport falls back to storing it as a string rather than failing the entire import. This handles messy real-world data gracefully.

Multiple Import Strategies¶

Synchronous (default): Single-threaded, straightforward imports
Async (--asyncpro): Event-loop based async imports using Motor driver
Multi-process (--multi): Parallel import using multiple CPU cores
Threaded (--threads): Thread-based parallel import

Performance Optimizations¶

Recent performance improvements include:

Pre-compiled type converters (15-25% faster)
Optimized field validation (5-10% faster)
Fast ISO date parsing (100x faster than generic date parsing)
Expected overall improvement: 20-35%

Typical throughput:

Sync: ~24,000-32,000 docs/sec
Async: ~30,000-40,000 docs/sec
Multi-process: ~50,000+ docs/sec

File Splitting¶

Large CSV files can be automatically split into smaller chunks for parallel processing:

pyimport --splitfile --autosplit 10 --multi --poolsize 4 largefile.csv

This splits the file into 10 chunks and processes them with 4 parallel workers.

When to Use PyImport¶

PyImport is ideal when you need to:

Import CSV data with inconsistent types or dirty data
Automatically detect and handle various date formats
Import large files quickly using parallel processing
Resume failed imports without starting over
Add metadata (timestamps, filenames, line numbers) to imported documents
Import from URLs or local files with the same command

Quick Example¶

Basic import:

# Generate field file
pyimport --genfieldfile mydata.csv

# Import with generated field file
pyimport --database mydb --collection mycol mydata.csv

Fast parallel import:

pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
         --database mydb --collection mycol largefile.csv

Architecture Overview¶

PyImport follows a clean architecture:

Field Files (.tff): TOML files defining column types and formats
CSV Reader: Reads and type-converts CSV data per field definitions
Enricher: Optionally adds metadata (timestamps, filenames, line numbers)
Database Writer: Batches and writes documents to MongoDB (or PostgreSQL)
Import Commands: Orchestrates the import process using different strategies

Compared to mongoimport¶

Feature	PyImport	mongoimport
Type inference	Automatic with `--genfieldfile`	Manual `--columnsHaveTypes`
Dirty data handling	Graceful fallback to string	Strict, may fail
Date formats	Multiple formats, automatic detection	Limited
Parallel processing	Built-in with `--multi` or `--threads`	Requires external scripting
Restart capability	Progress tracking infrastructure (in development)	Not built-in
CSV from URLs	Yes	No
File splitting	Built-in	Manual

Installation¶

See the Installation Guide for setup instructions.

Next Steps¶

Installation - Set up PyImport
Quick Start - Get started with basic imports
Command-Line Reference - Complete CLI options documentation
Field Files - Understanding .tff field files
Advanced Usage - Parallel processing, restart, and optimization