Introduction

What is PyImport?

PyImport is a powerful Python command-line tool for importing CSV data into MongoDB. Unlike MongoDB’s native mongoimport, PyImport focuses on:

  • Handling “dirty” data gracefully - Automatic type conversion with fallback to strings on errors

  • Automatic field type detection - Infers types from CSV data and generates field files

  • Multiple execution strategies - Sync, async, multi-process, and threaded imports

  • Parallel processing - Split large files and import in parallel for maximum throughput

  • Restart capability - Resume failed imports from where they left off

  • Flexible type conversion - Support for dates, timestamps, ISO dates, and custom formats

Key Features

Automatic Type Detection

PyImport can analyze your CSV file and automatically generate a field file (.tff) that defines the type of each column:

pyimport --genfieldfile data.csv

This creates data.tff with inferred types for each field.

Forgiving Type Conversion

If a field value cannot be converted to its specified type, PyImport falls back to storing it as a string rather than failing the entire import. This handles messy real-world data gracefully.

Multiple Import Strategies

  • Synchronous (default): Single-threaded, straightforward imports

  • Async (--asyncpro): Event-loop based async imports using Motor driver

  • Multi-process (--multi): Parallel import using multiple CPU cores

  • Threaded (--threads): Thread-based parallel import

Performance Optimizations

Recent performance improvements include:

  • Pre-compiled type converters (15-25% faster)

  • Optimized field validation (5-10% faster)

  • Fast ISO date parsing (100x faster than generic date parsing)

  • Expected overall improvement: 20-35%

Typical throughput:

  • Sync: ~24,000-32,000 docs/sec

  • Async: ~30,000-40,000 docs/sec

  • Multi-process: ~50,000+ docs/sec

File Splitting

Large CSV files can be automatically split into smaller chunks for parallel processing:

pyimport --splitfile --autosplit 10 --multi --poolsize 4 largefile.csv

This splits the file into 10 chunks and processes them with 4 parallel workers.

When to Use PyImport

PyImport is ideal when you need to:

  • Import CSV data with inconsistent types or dirty data

  • Automatically detect and handle various date formats

  • Import large files quickly using parallel processing

  • Resume failed imports without starting over

  • Add metadata (timestamps, filenames, line numbers) to imported documents

  • Import from URLs or local files with the same command

Quick Example

Basic import:

# Generate field file
pyimport --genfieldfile mydata.csv

# Import with generated field file
pyimport --database mydb --collection mycol mydata.csv

Fast parallel import:

pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
         --database mydb --collection mycol largefile.csv

Architecture Overview

PyImport follows a clean architecture:

  1. Field Files (.tff): TOML files defining column types and formats

  2. CSV Reader: Reads and type-converts CSV data per field definitions

  3. Enricher: Optionally adds metadata (timestamps, filenames, line numbers)

  4. Database Writer: Batches and writes documents to MongoDB (or PostgreSQL)

  5. Import Commands: Orchestrates the import process using different strategies

Compared to mongoimport

Feature

PyImport

mongoimport

Type inference

Automatic with --genfieldfile

Manual --columnsHaveTypes

Dirty data handling

Graceful fallback to string

Strict, may fail

Date formats

Multiple formats, automatic detection

Limited

Parallel processing

Built-in with --multi or --threads

Requires external scripting

Restart capability

Progress tracking infrastructure (in development)

Not built-in

CSV from URLs

Yes

No

File splitting

Built-in

Manual

Installation

See the Installation Guide for setup instructions.

Next Steps