# Introduction

## What is PyImport?

PyImport is a powerful Python command-line tool for importing CSV data into MongoDB. Unlike MongoDB's native `mongoimport`, PyImport focuses on:

- **Handling "dirty" data gracefully** - Automatic type conversion with fallback to strings on errors
- **Automatic field type detection** - Infers types from CSV data and generates field files
- **Multiple execution strategies** - Sync, async, multi-process, and threaded imports
- **Parallel processing** - Split large files and import in parallel for maximum throughput
- **Restart capability** - Resume failed imports from where they left off
- **Flexible type conversion** - Support for dates, timestamps, ISO dates, and custom formats

## Key Features

### Automatic Type Detection
PyImport can analyze your CSV file and automatically generate a field file (`.tff`) that defines the type of each column:

```bash
pyimport --genfieldfile data.csv
```

This creates `data.tff` with inferred types for each field.

### Forgiving Type Conversion
If a field value cannot be converted to its specified type, PyImport falls back to storing it as a string rather than failing the entire import. This handles messy real-world data gracefully.

### Multiple Import Strategies

- **Synchronous** (default): Single-threaded, straightforward imports
- **Async** (`--asyncpro`): Event-loop based async imports using Motor driver
- **Multi-process** (`--multi`): Parallel import using multiple CPU cores
- **Threaded** (`--threads`): Thread-based parallel import

### Performance Optimizations

Recent performance improvements include:
- Pre-compiled type converters (15-25% faster)
- Optimized field validation (5-10% faster)
- Fast ISO date parsing (100x faster than generic date parsing)
- Expected overall improvement: 20-35%

Typical throughput:
- **Sync**: ~24,000-32,000 docs/sec
- **Async**: ~30,000-40,000 docs/sec
- **Multi-process**: ~50,000+ docs/sec

### File Splitting
Large CSV files can be automatically split into smaller chunks for parallel processing:

```bash
pyimport --splitfile --autosplit 10 --multi --poolsize 4 largefile.csv
```

This splits the file into 10 chunks and processes them with 4 parallel workers.

## When to Use PyImport

PyImport is ideal when you need to:

- Import CSV data with inconsistent types or dirty data
- Automatically detect and handle various date formats
- Import large files quickly using parallel processing
- Resume failed imports without starting over
- Add metadata (timestamps, filenames, line numbers) to imported documents
- Import from URLs or local files with the same command

## Quick Example

Basic import:
```bash
# Generate field file
pyimport --genfieldfile mydata.csv

# Import with generated field file
pyimport --database mydb --collection mycol mydata.csv
```

Fast parallel import:
```bash
pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
         --database mydb --collection mycol largefile.csv
```

## Architecture Overview

PyImport follows a clean architecture:

1. **Field Files** (`.tff`): TOML files defining column types and formats
2. **CSV Reader**: Reads and type-converts CSV data per field definitions
3. **Enricher**: Optionally adds metadata (timestamps, filenames, line numbers)
4. **Database Writer**: Batches and writes documents to MongoDB (or PostgreSQL)
5. **Import Commands**: Orchestrates the import process using different strategies

## Compared to mongoimport

| Feature | PyImport | mongoimport |
|---------|----------|-------------|
| Type inference | Automatic with `--genfieldfile` | Manual `--columnsHaveTypes` |
| Dirty data handling | Graceful fallback to string | Strict, may fail |
| Date formats | Multiple formats, automatic detection | Limited |
| Parallel processing | Built-in with `--multi` or `--threads` | Requires external scripting |
| Restart capability | Progress tracking infrastructure (in development) | Not built-in |
| CSV from URLs | Yes | No |
| File splitting | Built-in | Manual |

## Installation

See the [Installation Guide](installation.md) for setup instructions.

## Next Steps

- [Installation](installation.md) - Set up PyImport
- [Quick Start](quickstart.md) - Get started with basic imports
- [Command-Line Reference](cli_reference.md) - Complete CLI options documentation
- [Field Files](fieldfiles.md) - Understanding `.tff` field files
- [Advanced Usage](advanced.md) - Parallel processing, restart, and optimization