Quick Start Guide¶

Get started with PyImport in minutes.

Your First Import¶

Step 1: Create a CSV File¶

Create a simple test file:

cat > people.csv <<EOF
name,age,city,salary,join_date
Alice,30,New York,75000,2020-01-15
Bob,25,Los Angeles,65000,2021-03-22
Charlie,35,Chicago,85000,2019-07-10
Diana,28,Houston,70000,2022-05-18
EOF

Step 2: Generate Field File¶

Let PyImport analyze your CSV and create a field file:

pyimport --genfieldfile people.csv

This creates people.tff with inferred types:

[name]
type = "str"

[age]
type = "int"

[city]
type = "str"

[salary]
type = "int"

[join_date]
type = "date"
format = "%Y-%m-%d"

Step 3: Import to MongoDB¶

pyimport --database mydb --collection employees people.csv

That’s it! Your data is now in MongoDB.

Step 4: Verify Import¶

Check your data using mongosh or your preferred MongoDB client:

mongosh mydb --eval "db.employees.find().pretty()"

Output:

{
  "_id": ObjectId("..."),
  "name": "Alice",
  "age": 30,
  "city": "New York",
  "salary": 75000,
  "join_date": ISODate("2020-01-15T00:00:00Z")
}
...

Common Scenarios¶

Importing Pipe-Delimited Files¶

# Your data
cat > data.txt <<EOF
name|age|department
Alice|30|Engineering
Bob|25|Sales
EOF

# Generate field file with correct delimiter
pyimport --genfieldfile --delimiter "|" data.txt

# Import
pyimport --delimiter "|" --database mydb --collection staff data.txt

Importing Tab-Delimited Files¶

pyimport --delimiter tab --database mydb --collection data data.tsv

Importing Files Without Headers¶

If your CSV doesn’t have a header row, create the field file manually:

# Your data (no header)
cat > data.csv <<EOF
Alice,30,NYC
Bob,25,LA
EOF

# Create field file manually
cat > data.tff <<EOF
[name]
type = "str"

[age]
type = "int"

[city]
type = "str"
EOF

# Import (don't use --hasheader)
pyimport --database mydb --collection people data.csv

Adding Metadata to Imports¶

Track where your data came from:

pyimport --addfilename --addtimestamp doc \
         --addfield source=ETL --addfield batch=morning \
         --database mydb --collection data data.csv

Result:

{
  "name": "Alice",
  "age": 30,
  "filename": "data.csv",
  "timestamp": ISODate("2024-01-15T10:30:00Z"),
  "source": "ETL",
  "batch": "morning"
}

Importing Large Files Fast¶

For files with millions of rows, use parallel processing:

pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
         --batchsize 5000 \
         --database mydb --collection bigdata \
         large_file.csv

This will:

Split the file into 8 chunks
Process with 4 parallel workers
Insert in batches of 5000 documents
Clean up split files automatically

Importing Multiple Files¶

Import several files with the same schema:

# All files share the same field file
pyimport --fieldfile schema.tff \
         --database mydb --collection combined \
         jan.csv feb.csv mar.csv

Or use a file list:

# Create file list
ls *.csv > files.txt

# Import all
pyimport --filelist files.txt --database mydb --collection all_data

Importing from URLs¶

PyImport can fetch CSV files from URLs:

pyimport --database mydb --collection web_data \
         https://example.com/data.csv

Testing Before Full Import¶

Import just the first 100 rows to verify everything works:

pyimport --limit 100 --loglevel DEBUG \
         --database test --collection sample \
         data.csv

Replacing Existing Data¶

Drop and recreate the collection:

pyimport --drop --database mydb --collection users users.csv

Warning: This deletes all existing data in the collection!

Working with Dates¶

ISO Dates (Fast)¶

If your dates are in ISO format (YYYY-MM-DD), PyImport will detect them automatically:

name,join_date
Alice,2020-01-15
Bob,2021-03-22

Field file:

[join_date]
type = "isodate"

This is 100x faster than generic date parsing!

Custom Date Formats¶

For non-ISO dates, PyImport can infer the format or you can specify it:

name,join_date
Alice,01/15/2020
Bob,03/22/2021

Auto-detected field file:

[join_date]
type = "date"
format = "%m/%d/%Y"

Timestamps¶

Unix timestamps:

name,event_time
Alice,1678901234
Bob,1678902345

Field file:

[event_time]
type = "timestamp"

Handling Errors¶

Skip Bad Rows¶

By default, PyImport warns about errors but continues:

pyimport --onerror Warn data.csv

Stop on First Error¶

For strict validation:

pyimport --onerror Fail data.csv

Debug Import Issues¶

Use verbose logging to see what’s happening:

pyimport --loglevel DEBUG --verbose data.csv

Performance Comparison¶

Import of 200,000 rows (NYC taxi data):

Method	Time	Docs/sec
Sync (default)	~8.3s	~24,000
Async	~6.6s	~30,000
Multi-process (4 cores)	~4.0s	~50,000

Commands:

# Default sync
pyimport data.csv

# Async
pyimport --asyncpro data.csv

# Multi-process
pyimport --multi --splitfile --autosplit 8 --poolsize 4 data.csv

Configuration File¶

Create ~/.pyimport.conf to avoid repeating options:

# MongoDB connection
mdburi = mongodb://localhost:27017
database = mydb

# Import settings
batchsize = 5000
hasheader = True
addfilename = True
addtimestamp = doc

# Parallel processing
poolsize = 4

Then simply:

pyimport --collection users users.csv

Next Steps¶

Now that you know the basics:

Command-Line Reference - Complete list of all options
Field Files - Deep dive into type conversion
Advanced Usage - Optimization, troubleshooting, and advanced features

Common Issues¶

“Connection refused”¶

MongoDB isn’t running. Start it:

# macOS
brew services start mongodb-community

# Linux
sudo systemctl start mongodb

# Docker
docker run -d -p 27017:27017 mongo

“Field count mismatch”¶

Your CSV has inconsistent column counts. Check for:

Missing commas
Extra commas in data
Wrong delimiter setting

Use --loglevel DEBUG to see which row is causing issues.

“No field file found”¶

PyImport couldn’t find a .tff file. Either:

Generate one: pyimport --genfieldfile data.csv
Specify explicitly: pyimport --fieldfile myfields.tff data.csv

Dates Not Parsing¶

If dates aren’t converting properly:

Check the format in your field file
Use ISO format (YYYY-MM-DD) when possible

Specify format explicitly:

[date]
type = "date"
format = "%d/%m/%Y"  # DD/MM/YYYY

Import Is Slow¶

Try these optimizations:

Use --multi --splitfile
Increase --batchsize to 5000-10000
Use --writeconcern 0 for fastest writes
Ensure MongoDB has proper indexes (add after import)
Use SSD storage for MongoDB