Field Files (.tff)¶
Field files define how CSV columns should be parsed and converted to MongoDB types. They use TOML format (.tff extension).
What is a Field File?¶
A field file maps each CSV column to a data type and optionally a format specification. When PyImport reads your CSV, it uses the field file to:
Convert strings to appropriate types (int, float, date, etc.)
Handle date/time formats
Validate data
Fall back to strings on conversion errors (unless
--onerror Failis used)
Generating Field Files¶
The easiest way to create a field file is to let PyImport infer types:
pyimport --genfieldfile data.csv
This analyzes your CSV and creates data.tff with detected types.
Field File Format¶
Field files use TOML syntax. Each field has its own section:
[column_name]
type = "type_name"
format = "format_string" # Optional, for dates
Basic Example¶
Given this CSV:
name,age,salary,active
Alice,30,75000.50,true
Bob,25,65000.00,false
Field file (data.tff):
[name]
type = "str"
[age]
type = "int"
[salary]
type = "float"
[active]
type = "bool" # Now supported in v2.0.8+
Supported Types¶
String (str)¶
Text data. No conversion applied.
[name]
type = "str"
name
Alice
Bob
Result: {"name": "Alice"}
Integer (int)¶
Whole numbers. Handles conversion from float strings.
[age]
type = "int"
age
30
42.0 # Converts to 42
Result: {"age": 30}
Float (float)¶
Decimal numbers.
[salary]
type = "float"
salary
75000.50
80000
Result: {"salary": 75000.50}
Boolean (bool)¶
New in v2.0.8! - Boolean/logical values.
[is_active]
type = "bool"
is_active
true
false
yes
no
1
0
Result: {"is_active": true} or {"is_active": false}
Accepted values (case-insensitive):
true,t,yes,y,1→truefalse,f,no,n,0→false
Note: Prior to v2.0.8, boolean values had to be stored as strings.
Date (date)¶
Date values without time component.
Without format (slow):
[birth_date]
type = "date"
Uses dateutil.parser - flexible but slow (~10-100x slower).
With format (fast):
[birth_date]
type = "date"
format = "%Y-%m-%d"
Uses strptime - fast and predictable.
birth_date
1990-05-15
1985-12-20
Result: {"birth_date": ISODate("1990-05-15T00:00:00Z")}
DateTime (datetime)¶
Date with time component.
Without format:
[created_at]
type = "datetime"
With format:
[created_at]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"
created_at
2024-01-15 14:30:00
2024-01-16 09:15:30
Result: {"created_at": ISODate("2024-01-15T14:30:00Z")}
ISO Date (isodate)¶
Fastest date parsing - specifically for ISO 8601 format (YYYY-MM-DD).
[date]
type = "isodate"
date
2024-01-15
2024-01-16
Performance: ~100x faster than generic date parsing. Use this whenever possible!
Timestamp (timestamp)¶
Unix timestamp (seconds since epoch).
[event_time]
type = "timestamp"
event_time
1705334400
1705420800
Result: {"event_time": ISODate("2024-01-15T12:00:00Z")}
Date Format Strings¶
Format strings use Python’s strptime syntax:
Code |
Meaning |
Example |
|---|---|---|
|
4-digit year |
2024 |
|
2-digit year |
24 |
|
Month (01-12) |
01 |
|
Day (01-31) |
15 |
|
Hour 24h (00-23) |
14 |
|
Hour 12h (01-12) |
02 |
|
Minute (00-59) |
30 |
|
Second (00-59) |
45 |
|
AM/PM |
PM |
|
Abbreviated month |
Jan |
|
Full month |
January |
|
Abbreviated day |
Mon |
|
Full day |
Monday |
Common Date Format Examples¶
# US date: 01/15/2024
[date]
type = "date"
format = "%m/%d/%Y"
# European date: 15/01/2024
[date]
type = "date"
format = "%d/%m/%Y"
# ISO date: 2024-01-15
[date]
type = "isodate" # No format needed!
# Date with time: 2024-01-15 14:30:00
[datetime]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"
# 12-hour time: 01/15/2024 02:30 PM
[datetime]
type = "datetime"
format = "%m/%d/%Y %I:%M %p"
# Long format: January 15, 2024
[date]
type = "date"
format = "%B %d, %Y"
Defaults Section¶
The DEFAULTS_SECTION defines CSV parsing options:
[DEFAULTS_SECTION]
delimiter = ","
has_header = true
Options¶
delimiter: Field separator (default:,)Use
\tfor tabAny string:
|,;, etc.
has_header: Whether first row is header (default:true)
Example with pipe delimiter:
[DEFAULTS_SECTION]
delimiter = "|"
has_header = true
[name]
type = "str"
[age]
type = "int"
Advanced Field File Examples¶
Mixed Date Formats¶
You can have different date formats for different columns:
[birth_date]
type = "date"
format = "%m/%d/%Y" # US format
[hire_date]
type = "isodate" # ISO format (fastest)
[last_login]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"
Optional Format Inference¶
Let PyImport infer date formats:
pyimport --genfieldfile data.csv
Result:
[date_column]
type = "date"
format = "%Y-%m-%d" # Automatically detected!
NULL and Empty String Handling¶
PyImport handles NULL and empty strings gracefully:
name,age,city
Alice,30,NYC
Bob,, # Empty age, empty city
Charlie,35,NULL
With this field file:
[name]
type = "str"
[age]
type = "int"
[city]
type = "str"
Result:
{"name": "Alice", "age": 30, "city": "NYC"}
{"name": "Bob", "age": "", "city": ""} // Falls back to string
{"name": "Charlie", "age": 35, "city": "NULL"} // Stored as string "NULL"
Type Conversion Behavior¶
Successful Conversion¶
age
30
42
25
[age]
type = "int"
Result: All values stored as integers.
Failed Conversion (Default: Warn)¶
age
30
invalid
25
With --onerror Warn (default):
Logs warning about “invalid”
Stores “invalid” as string
Continues processing
Result:
{"age": 30}
{"age": "invalid"} // Fallback to string
{"age": 25}
Failed Conversion (Strict Mode)¶
With --onerror Fail:
pyimport --onerror Fail data.csv
Stops on first conversion error.
Field File Discovery¶
PyImport looks for field files in this order:
Explicit:
--fieldfile custom.tffAuto-discovery: Same name as CSV with
.tffextensiondata.csv→ looks fordata.tffusers.csv→ looks forusers.tff
Auto-generation: If no field file found and has header, attempts to infer types
Creating Field Files Manually¶
Step 1: Look at Your CSV¶
name,age,salary,join_date,is_active
Alice,30,75000,2020-01-15,true
Bob,25,65000,2021-03-22,false
Step 2: Determine Types¶
name: Stringage: Integersalary: Integer (could be float if decimals)join_date: ISO dateis_active: String (MongoDB has no native bool from CSV)
Step 3: Create .tff File¶
[DEFAULTS_SECTION]
delimiter = ","
has_header = true
[name]
type = "str"
[age]
type = "int"
[salary]
type = "int"
[join_date]
type = "isodate"
[is_active]
type = "str"
Step 4: Test¶
# Import first 10 rows to verify
pyimport --limit 10 --loglevel DEBUG \
--database test --collection sample \
--fieldfile myfields.tff \
data.csv
# Check results
mongosh test --eval "db.sample.find().pretty()"
Troubleshooting¶
“Field count mismatch”¶
Your CSV has inconsistent column counts.
Solution: Check your data for:
Missing delimiters
Wrong delimiter (using
,when data uses|)Extra delimiters in quoted fields
Dates Not Converting¶
Problem: Dates stored as strings instead of Date objects.
Solutions:
Check format string matches your data exactly
Use
isodatetype for ISO format (YYYY-MM-DD)Let PyImport infer format:
--genfieldfileVerify sample data with
--limit 10 --loglevel DEBUG
Numbers Stored as Strings¶
Problem: Numeric values stored as strings.
Solution: Check field file has correct type:
[price]
type = "float" # Not "str"
Field File Not Found¶
Problem: pyimport can’t find .tff file.
Solutions:
# Generate field file
pyimport --genfieldfile data.csv
# Or specify explicitly
pyimport --fieldfile path/to/fields.tff data.csv
Performance Tips¶
Use
isodatefor ISO dates - 100x faster than generic parsingAlways specify date formats -
formatparameter makes parsing much fasterUse appropriate types - Don’t use
strfor everythingTest with
--limit- Verify field file with small sample first
Field File Examples¶
Financial Data¶
[DEFAULTS_SECTION]
delimiter = ","
has_header = true
[transaction_id]
type = "str"
[date]
type = "isodate"
[amount]
type = "float"
[quantity]
type = "int"
[symbol]
type = "str"
[timestamp]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"
Log Data¶
[DEFAULTS_SECTION]
delimiter = "|"
has_header = false
[timestamp]
type = "timestamp"
[level]
type = "str"
[message]
type = "str"
[user_id]
type = "int"
NYC Taxi Data¶
[DEFAULTS_SECTION]
delimiter = ","
has_header = true
[VendorID]
type = "int"
[tpep_pickup_datetime]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"
[tpep_dropoff_datetime]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"
[passenger_count]
type = "int"
[trip_distance]
type = "float"
[fare_amount]
type = "float"
[tip_amount]
type = "float"
[total_amount]
type = "float"
TFF v2.0: Nested Document Mapping (NEW!)¶
Available in pyimport 2.0+
TFF v2.0 format allows you to map flat CSV data to nested JSON/MongoDB documents. This is perfect for creating properly structured documents instead of flat key-value pairs.
Why Use Nested Mapping?¶
Before (v1.0 - Flat Structure):
{
"first_name": "John",
"last_name": "Doe",
"street": "123 Main St",
"city": "Boston",
"state": "MA",
"zip": "02101"
}
After (v2.0 - Nested Structure):
{
"name": {
"first": "John",
"last": "Doe"
},
"address": {
"street": "123 Main St",
"city": "Boston",
"state": "MA",
"postal_code": "02101"
}
}
Basic Nested Mapping¶
Add a path field to specify where the value should be placed in the nested structure:
[first_name]
type = "str"
name = "first_name"
path = "name.first" # NEW: Nested path using dot notation
format = ""
[last_name]
type = "str"
name = "last_name"
path = "name.last"
format = ""
[city]
type = "str"
name = "city"
path = "address.city"
format = ""
[state]
type = "str"
name = "state"
path = "address.state"
format = ""
[DEFAULTS_SECTION]
delimiter = ","
has_header = true
Path Notation¶
Use dot notation to specify nested levels:
"name"→ Top-level field"user.name"→ One level:{user: {name: value}}"user.profile.name"→ Two levels:{user: {profile: {name: value}}}"data.metrics.score.final"→ Three levels
No limit on nesting depth!
Mixed v1.0 and v2.0 Fields¶
You can mix fields with and without paths in the same file:
[id]
type = "int"
name = "id"
# No path - stays at top level
[first_name]
type = "str"
name = "first_name"
path = "profile.name.first" # Nested
[email]
type = "str"
name = "email"
# No path - stays at top level
[city]
type = "str"
name = "city"
path = "profile.location.city" # Nested
Result:
{
"id": 123, // Top-level (no path)
"email": "user@example.com", // Top-level (no path)
"profile": { // Nested (has path)
"name": {
"first": "John"
},
"location": {
"city": "Boston"
}
}
}
Real-World Example: Healthcare Data¶
Transform flat A&E (Accident & Emergency) data into structured documents:
[SHA]
type = "str"
name = "SHA"
path = "organization.sha_code"
[Code]
type = "str"
name = "Code"
path = "organization.code"
[Name]
type = "str"
name = "Name"
path = "organization.name"
[Type1_Attendances]
type = "int"
name = "Type1_Attendances"
path = "departments.type1.attendances"
[Type1_Over4Hours]
type = "int"
name = "Type1_Over4Hours"
path = "departments.type1.over_4_hours"
[Percentage_4Hours]
type = "float"
name = "Percentage_4Hours"
path = "performance.within_4_hours_pct"
[Admissions_Type1]
type = "int"
name = "Admissions_Type1"
path = "admissions.type1"
[DEFAULTS_SECTION]
delimiter = ","
has_header = true
Result:
{
"organization": {
"sha_code": "Q44",
"code": "REM",
"name": "BURTON HOSPITALS NHS TRUST"
},
"departments": {
"type1": {
"attendances": 7523,
"over_4_hours": 1234
}
},
"performance": {
"within_4_hours_pct": 83.6
},
"admissions": {
"type1": 1456
}
}
Real-World Example: NYC Taxi with Geospatial¶
Create GeoJSON-compatible nested structures for location data:
[VendorID]
type = "int"
name = "VendorID"
path = "vendor.id"
[tpep_pickup_datetime]
type = "datetime"
name = "tpep_pickup_datetime"
path = "pickup.datetime"
format = "%Y-%m-%d %H:%M:%S"
[pickup_longitude]
type = "float"
name = "pickup_longitude"
path = "pickup.location.coordinates.longitude"
[pickup_latitude]
type = "float"
name = "pickup_latitude"
path = "pickup.location.coordinates.latitude"
[dropoff_longitude]
type = "float"
name = "dropoff_longitude"
path = "dropoff.location.coordinates.longitude"
[dropoff_latitude]
type = "float"
name = "dropoff_latitude"
path = "dropoff.location.coordinates.latitude"
[passenger_count]
type = "int"
name = "passenger_count"
path = "trip.passengers"
[trip_distance]
type = "float"
name = "trip_distance"
path = "trip.distance"
[fare_amount]
type = "float"
name = "fare_amount"
path = "payment.fare"
[tip_amount]
type = "float"
name = "tip_amount"
path = "payment.tip"
[total_amount]
type = "float"
name = "total_amount"
path = "payment.total"
[DEFAULTS_SECTION]
delimiter = ","
has_header = true
Result:
{
"vendor": {
"id": 1
},
"pickup": {
"datetime": ISODate("2024-01-15T10:30:00Z"),
"location": {
"coordinates": {
"longitude": -73.9851,
"latitude": 40.7589
}
}
},
"dropoff": {
"location": {
"coordinates": {
"longitude": -73.9731,
"latitude": 40.7644
}
}
},
"trip": {
"passengers": 2,
"distance": 1.5
},
"payment": {
"fare": 12.5,
"tip": 2.5,
"total": 15.0
}
}
Note: For proper GeoJSON, you’d need Phase 2 features (field composition) to create the type: "Point" structure. This example shows the nested coordinates structure.
Querying Nested Documents¶
MongoDB queries work naturally with nested fields:
// Find by nested field
db.taxi.find({"vendor.id": 1})
// Range query on nested field
db.taxi.find({"payment.total": {$gte: 20}})
// Query multiple nested levels
db.taxi.find({
"pickup.location.coordinates.latitude": {$gte: 40.7, $lte: 40.8}
})
// Complex query
db.taxi.find({
"trip.distance": {$gte: 5},
"payment.tip": {$gte: 5}
})
Create Indexes on Nested Fields¶
// Index on nested field
db.taxi.createIndex({"vendor.id": 1})
// Compound index with nested fields
db.taxi.createIndex({
"pickup.datetime": 1,
"vendor.id": 1
})
// Geospatial index (after restructuring to GeoJSON)
db.taxi.createIndex({"pickup.location": "2dsphere"})
Path Validation¶
PyImport validates paths to prevent conflicts:
❌ Invalid - Prefix Conflict:
[field1]
type = "str"
path = "address"
[field2]
type = "str"
path = "address.city" # ERROR: Conflicts with "address"
❌ Invalid - Duplicate Path:
[field1]
type = "str"
path = "data.value"
[field2]
type = "str"
path = "data.value" # ERROR: Duplicate path
✅ Valid:
[field1]
type = "str"
path = "address.city"
[field2]
type = "str"
path = "address.state" # OK: Same parent, different fields
Backward Compatibility¶
100% backward compatible! All existing v1.0 field files work unchanged:
No
pathfield? → Flat structure (v1.0 behavior)Has
pathfield? → Nested structure (v2.0 behavior)Mixed? → Combination of both
Migrating from v1.0 to v2.0:
Start with existing v1.0 file
Add
pathto fields you want to nestLeave others without
path(they stay top-level)Import works immediately - no other changes needed!
Performance¶
v1.0 files: No performance impact
v2.0 files: < 5% overhead for nested document construction
Large datasets: Minimal impact (tested with 100k+ rows)
When to Use v2.0¶
Use nested mapping when:
You want properly structured documents
Data has logical groupings (name parts, address parts, etc.)
You need to query nested fields efficiently
You want MongoDB schema validation on nested structures
Stick with v1.0 when:
Simple flat data is sufficient
No logical grouping in your data
Performance is absolutely critical (though v2.0 is only ~5% slower)
Common Patterns¶
Name Components:
path = "name.first"
path = "name.last"
path = "name.middle"
Address Components:
path = "address.street"
path = "address.city"
path = "address.state"
path = "address.postal_code"
path = "address.country"
Contact Information:
path = "contact.email"
path = "contact.phone.mobile"
path = "contact.phone.home"
Metrics/Stats:
path = "stats.count"
path = "stats.average"
path = "stats.maximum"
path = "stats.minimum"
Timestamps:
path = "metadata.created_at"
path = "metadata.updated_at"
path = "metadata.deleted_at"
See Also¶
Quick Start - Basic usage examples
Command-Line Reference - All CLI options
Advanced Usage - Optimization and troubleshooting
TFF v2.0 Migration Guide - Detailed migration guide