Quick Start Guide¶
Get started with PyImport in minutes.
Your First Import¶
Step 1: Create a CSV File¶
Create a simple test file:
cat > people.csv <<EOF
name,age,city,salary,join_date
Alice,30,New York,75000,2020-01-15
Bob,25,Los Angeles,65000,2021-03-22
Charlie,35,Chicago,85000,2019-07-10
Diana,28,Houston,70000,2022-05-18
EOF
Step 2: Generate Field File¶
Let PyImport analyze your CSV and create a field file:
pyimport --genfieldfile people.csv
This creates people.tff with inferred types:
[name]
type = "str"
[age]
type = "int"
[city]
type = "str"
[salary]
type = "int"
[join_date]
type = "date"
format = "%Y-%m-%d"
Step 3: Import to MongoDB¶
pyimport --database mydb --collection employees people.csv
That’s it! Your data is now in MongoDB.
Step 4: Verify Import¶
Check your data using mongosh or your preferred MongoDB client:
mongosh mydb --eval "db.employees.find().pretty()"
Output:
{
"_id": ObjectId("..."),
"name": "Alice",
"age": 30,
"city": "New York",
"salary": 75000,
"join_date": ISODate("2020-01-15T00:00:00Z")
}
...
Common Scenarios¶
Importing Pipe-Delimited Files¶
# Your data
cat > data.txt <<EOF
name|age|department
Alice|30|Engineering
Bob|25|Sales
EOF
# Generate field file with correct delimiter
pyimport --genfieldfile --delimiter "|" data.txt
# Import
pyimport --delimiter "|" --database mydb --collection staff data.txt
Importing Tab-Delimited Files¶
pyimport --delimiter tab --database mydb --collection data data.tsv
Importing Files Without Headers¶
If your CSV doesn’t have a header row, create the field file manually:
# Your data (no header)
cat > data.csv <<EOF
Alice,30,NYC
Bob,25,LA
EOF
# Create field file manually
cat > data.tff <<EOF
[name]
type = "str"
[age]
type = "int"
[city]
type = "str"
EOF
# Import (don't use --hasheader)
pyimport --database mydb --collection people data.csv
Adding Metadata to Imports¶
Track where your data came from:
pyimport --addfilename --addtimestamp doc \
--addfield source=ETL --addfield batch=morning \
--database mydb --collection data data.csv
Result:
{
"name": "Alice",
"age": 30,
"filename": "data.csv",
"timestamp": ISODate("2024-01-15T10:30:00Z"),
"source": "ETL",
"batch": "morning"
}
Importing Large Files Fast¶
For files with millions of rows, use parallel processing:
pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
--batchsize 5000 \
--database mydb --collection bigdata \
large_file.csv
This will:
Split the file into 8 chunks
Process with 4 parallel workers
Insert in batches of 5000 documents
Clean up split files automatically
Importing Multiple Files¶
Import several files with the same schema:
# All files share the same field file
pyimport --fieldfile schema.tff \
--database mydb --collection combined \
jan.csv feb.csv mar.csv
Or use a file list:
# Create file list
ls *.csv > files.txt
# Import all
pyimport --filelist files.txt --database mydb --collection all_data
Importing from URLs¶
PyImport can fetch CSV files from URLs:
pyimport --database mydb --collection web_data \
https://example.com/data.csv
Testing Before Full Import¶
Import just the first 100 rows to verify everything works:
pyimport --limit 100 --loglevel DEBUG \
--database test --collection sample \
data.csv
Replacing Existing Data¶
Drop and recreate the collection:
pyimport --drop --database mydb --collection users users.csv
Warning: This deletes all existing data in the collection!
Working with Dates¶
ISO Dates (Fast)¶
If your dates are in ISO format (YYYY-MM-DD), PyImport will detect them automatically:
name,join_date
Alice,2020-01-15
Bob,2021-03-22
Field file:
[join_date]
type = "isodate"
This is 100x faster than generic date parsing!
Custom Date Formats¶
For non-ISO dates, PyImport can infer the format or you can specify it:
name,join_date
Alice,01/15/2020
Bob,03/22/2021
Auto-detected field file:
[join_date]
type = "date"
format = "%m/%d/%Y"
Timestamps¶
Unix timestamps:
name,event_time
Alice,1678901234
Bob,1678902345
Field file:
[event_time]
type = "timestamp"
Handling Errors¶
Skip Bad Rows¶
By default, PyImport warns about errors but continues:
pyimport --onerror Warn data.csv
Stop on First Error¶
For strict validation:
pyimport --onerror Fail data.csv
Debug Import Issues¶
Use verbose logging to see what’s happening:
pyimport --loglevel DEBUG --verbose data.csv
Performance Comparison¶
Import of 200,000 rows (NYC taxi data):
Method |
Time |
Docs/sec |
|---|---|---|
Sync (default) |
~8.3s |
~24,000 |
Async |
~6.6s |
~30,000 |
Multi-process (4 cores) |
~4.0s |
~50,000 |
Commands:
# Default sync
pyimport data.csv
# Async
pyimport --asyncpro data.csv
# Multi-process
pyimport --multi --splitfile --autosplit 8 --poolsize 4 data.csv
Configuration File¶
Create ~/.pyimport.conf to avoid repeating options:
# MongoDB connection
mdburi = mongodb://localhost:27017
database = mydb
# Import settings
batchsize = 5000
hasheader = True
addfilename = True
addtimestamp = doc
# Parallel processing
poolsize = 4
Then simply:
pyimport --collection users users.csv
Next Steps¶
Now that you know the basics:
Command-Line Reference - Complete list of all options
Field Files - Deep dive into type conversion
Advanced Usage - Optimization, troubleshooting, and advanced features
Common Issues¶
“Connection refused”¶
MongoDB isn’t running. Start it:
# macOS
brew services start mongodb-community
# Linux
sudo systemctl start mongodb
# Docker
docker run -d -p 27017:27017 mongo
“Field count mismatch”¶
Your CSV has inconsistent column counts. Check for:
Missing commas
Extra commas in data
Wrong delimiter setting
Use --loglevel DEBUG to see which row is causing issues.
“No field file found”¶
PyImport couldn’t find a .tff file. Either:
Generate one:
pyimport --genfieldfile data.csvSpecify explicitly:
pyimport --fieldfile myfields.tff data.csv
Dates Not Parsing¶
If dates aren’t converting properly:
Check the format in your field file
Use ISO format (YYYY-MM-DD) when possible
Specify format explicitly:
[date] type = "date" format = "%d/%m/%Y" # DD/MM/YYYY
Import Is Slow¶
Try these optimizations:
Use
--multi --splitfileIncrease
--batchsizeto 5000-10000Use
--writeconcern 0for fastest writesEnsure MongoDB has proper indexes (add after import)
Use SSD storage for MongoDB