Field Files (.tff)¶

Field files define how CSV columns should be parsed and converted to MongoDB types. They use TOML format (.tff extension).

What is a Field File?¶

A field file maps each CSV column to a data type and optionally a format specification. When PyImport reads your CSV, it uses the field file to:

Convert strings to appropriate types (int, float, date, etc.)
Handle date/time formats
Validate data
Fall back to strings on conversion errors (unless --onerror Fail is used)

Generating Field Files¶

The easiest way to create a field file is to let PyImport infer types:

pyimport --genfieldfile data.csv

This analyzes your CSV and creates data.tff with detected types.

Field File Format¶

Field files use TOML syntax. Each field has its own section:

[column_name]
type = "type_name"
format = "format_string"  # Optional, for dates

Basic Example¶

Given this CSV:

name,age,salary,active
Alice,30,75000.50,true
Bob,25,65000.00,false

Field file (data.tff):

[name]
type = "str"

[age]
type = "int"

[salary]
type = "float"

[active]
type = "bool"  # Now supported in v2.0.8+

Supported Types¶

String (`str`)¶

Text data. No conversion applied.

[name]
type = "str"

name
Alice
Bob

Result: {"name": "Alice"}

Integer (`int`)¶

Whole numbers. Handles conversion from float strings.

[age]
type = "int"

age
30
42.0  # Converts to 42

Result: {"age": 30}

Float (`float`)¶

Decimal numbers.

[salary]
type = "float"

salary
75000.50
80000

Result: {"salary": 75000.50}

Boolean (`bool`)¶

New in v2.0.8! - Boolean/logical values.

[is_active]
type = "bool"

is_active
true
false
yes
no
1
0

Result: {"is_active": true} or {"is_active": false}

Accepted values (case-insensitive):

true, t, yes, y, 1 → true
false, f, no, n, 0 → false

Note: Prior to v2.0.8, boolean values had to be stored as strings.

Date (`date`)¶

Date values without time component.

Without format (slow):

[birth_date]
type = "date"

Uses dateutil.parser - flexible but slow (~10-100x slower).

With format (fast):

[birth_date]
type = "date"
format = "%Y-%m-%d"

Uses strptime - fast and predictable.

birth_date
1990-05-15
1985-12-20

Result: {"birth_date": ISODate("1990-05-15T00:00:00Z")}

DateTime (`datetime`)¶

Date with time component.

Without format:

[created_at]
type = "datetime"

With format:

[created_at]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"

created_at
2024-01-15 14:30:00
2024-01-16 09:15:30

Result: {"created_at": ISODate("2024-01-15T14:30:00Z")}

ISO Date (`isodate`)¶

Fastest date parsing - specifically for ISO 8601 format (YYYY-MM-DD).

[date]
type = "isodate"

date
2024-01-15
2024-01-16

Performance: ~100x faster than generic date parsing. Use this whenever possible!

Timestamp (`timestamp`)¶

Unix timestamp (seconds since epoch).

[event_time]
type = "timestamp"

event_time
1705334400
1705420800

Result: {"event_time": ISODate("2024-01-15T12:00:00Z")}

Date Format Strings¶

Format strings use Python’s strptime syntax:

Code	Meaning	Example
`%Y`	4-digit year	2024
`%y`	2-digit year	24
`%m`	Month (01-12)	01
`%d`	Day (01-31)	15
`%H`	Hour 24h (00-23)	14
`%I`	Hour 12h (01-12)	02
`%M`	Minute (00-59)	30
`%S`	Second (00-59)	45
`%p`	AM/PM	PM
`%b`	Abbreviated month	Jan
`%B`	Full month	January
`%a`	Abbreviated day	Mon
`%A`	Full day	Monday

Common Date Format Examples¶

# US date: 01/15/2024
[date]
type = "date"
format = "%m/%d/%Y"

# European date: 15/01/2024
[date]
type = "date"
format = "%d/%m/%Y"

# ISO date: 2024-01-15
[date]
type = "isodate"  # No format needed!

# Date with time: 2024-01-15 14:30:00
[datetime]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"

# 12-hour time: 01/15/2024 02:30 PM
[datetime]
type = "datetime"
format = "%m/%d/%Y %I:%M %p"

# Long format: January 15, 2024
[date]
type = "date"
format = "%B %d, %Y"

Defaults Section¶

The DEFAULTS_SECTION defines CSV parsing options:

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

Options¶

delimiter: Field separator (default: ,)
- Use \t for tab
- Any string: |, ;, etc.
has_header: Whether first row is header (default: true)

Example with pipe delimiter:

[DEFAULTS_SECTION]
delimiter = "|"
has_header = true

[name]
type = "str"

[age]
type = "int"

Advanced Field File Examples¶

Mixed Date Formats¶

You can have different date formats for different columns:

[birth_date]
type = "date"
format = "%m/%d/%Y"  # US format

[hire_date]
type = "isodate"  # ISO format (fastest)

[last_login]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"

Optional Format Inference¶

Let PyImport infer date formats:

pyimport --genfieldfile data.csv

Result:

[date_column]
type = "date"
format = "%Y-%m-%d"  # Automatically detected!

NULL and Empty String Handling¶

PyImport handles NULL and empty strings gracefully:

name,age,city
Alice,30,NYC
Bob,,  # Empty age, empty city
Charlie,35,NULL

With this field file:

[name]
type = "str"

[age]
type = "int"

[city]
type = "str"

Result:

{"name": "Alice", "age": 30, "city": "NYC"}
{"name": "Bob", "age": "", "city": ""}  // Falls back to string
{"name": "Charlie", "age": 35, "city": "NULL"}  // Stored as string "NULL"

Type Conversion Behavior¶

Successful Conversion¶

age
30
42
25

[age]
type = "int"

Result: All values stored as integers.

Failed Conversion (Default: Warn)¶

age
30
invalid
25

With --onerror Warn (default):

Logs warning about “invalid”
Stores “invalid” as string
Continues processing

Result:

{"age": 30}
{"age": "invalid"}  // Fallback to string
{"age": 25}

Failed Conversion (Strict Mode)¶

With --onerror Fail:

pyimport --onerror Fail data.csv

Stops on first conversion error.

Field File Discovery¶

PyImport looks for field files in this order:

Explicit: --fieldfile custom.tff
Auto-discovery: Same name as CSV with .tff extension
- data.csv → looks for data.tff
- users.csv → looks for users.tff
Auto-generation: If no field file found and has header, attempts to infer types

Creating Field Files Manually¶

Step 1: Look at Your CSV¶

name,age,salary,join_date,is_active
Alice,30,75000,2020-01-15,true
Bob,25,65000,2021-03-22,false

Step 2: Determine Types¶

name: String
age: Integer
salary: Integer (could be float if decimals)
join_date: ISO date
is_active: String (MongoDB has no native bool from CSV)

Step 3: Create `.tff` File¶

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

[name]
type = "str"

[age]
type = "int"

[salary]
type = "int"

[join_date]
type = "isodate"

[is_active]
type = "str"

Step 4: Test¶

# Import first 10 rows to verify
pyimport --limit 10 --loglevel DEBUG \
         --database test --collection sample \
         --fieldfile myfields.tff \
         data.csv

# Check results
mongosh test --eval "db.sample.find().pretty()"

Troubleshooting¶

“Field count mismatch”¶

Your CSV has inconsistent column counts.

Solution: Check your data for:

Missing delimiters
Wrong delimiter (using , when data uses |)
Extra delimiters in quoted fields

Dates Not Converting¶

Problem: Dates stored as strings instead of Date objects.

Solutions:

Check format string matches your data exactly
Use isodate type for ISO format (YYYY-MM-DD)
Let PyImport infer format: --genfieldfile
Verify sample data with --limit 10 --loglevel DEBUG

Numbers Stored as Strings¶

Problem: Numeric values stored as strings.

Solution: Check field file has correct type:

[price]
type = "float"  # Not "str"

Field File Not Found¶

Problem: pyimport can’t find .tff file.

Solutions:

# Generate field file
pyimport --genfieldfile data.csv

# Or specify explicitly
pyimport --fieldfile path/to/fields.tff data.csv

Performance Tips¶

Use isodate for ISO dates - 100x faster than generic parsing
Always specify date formats - format parameter makes parsing much faster
Use appropriate types - Don’t use str for everything
Test with --limit - Verify field file with small sample first

Field File Examples¶

Financial Data¶

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

[transaction_id]
type = "str"

[date]
type = "isodate"

[amount]
type = "float"

[quantity]
type = "int"

[symbol]
type = "str"

[timestamp]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"

Log Data¶

[DEFAULTS_SECTION]
delimiter = "|"
has_header = false

[timestamp]
type = "timestamp"

[level]
type = "str"

[message]
type = "str"

[user_id]
type = "int"

NYC Taxi Data¶

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

[VendorID]
type = "int"

[tpep_pickup_datetime]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"

[tpep_dropoff_datetime]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"

[passenger_count]
type = "int"

[trip_distance]
type = "float"

[fare_amount]
type = "float"

[tip_amount]
type = "float"

[total_amount]
type = "float"

TFF v2.0: Nested Document Mapping (NEW!)¶

Available in pyimport 2.0+

TFF v2.0 format allows you to map flat CSV data to nested JSON/MongoDB documents. This is perfect for creating properly structured documents instead of flat key-value pairs.

Why Use Nested Mapping?¶

Before (v1.0 - Flat Structure):

{
  "first_name": "John",
  "last_name": "Doe",
  "street": "123 Main St",
  "city": "Boston",
  "state": "MA",
  "zip": "02101"
}

After (v2.0 - Nested Structure):

{
  "name": {
    "first": "John",
    "last": "Doe"
  },
  "address": {
    "street": "123 Main St",
    "city": "Boston",
    "state": "MA",
    "postal_code": "02101"
  }
}

Basic Nested Mapping¶

Add a path field to specify where the value should be placed in the nested structure:

[first_name]
type = "str"
name = "first_name"
path = "name.first"  # NEW: Nested path using dot notation
format = ""

[last_name]
type = "str"
name = "last_name"
path = "name.last"
format = ""

[city]
type = "str"
name = "city"
path = "address.city"
format = ""

[state]
type = "str"
name = "state"
path = "address.state"
format = ""

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

Path Notation¶

Use dot notation to specify nested levels:

"name" → Top-level field
"user.name" → One level: {user: {name: value}}
"user.profile.name" → Two levels: {user: {profile: {name: value}}}
"data.metrics.score.final" → Three levels

No limit on nesting depth!

Mixed v1.0 and v2.0 Fields¶

You can mix fields with and without paths in the same file:

[id]
type = "int"
name = "id"
# No path - stays at top level

[first_name]
type = "str"
name = "first_name"
path = "profile.name.first"  # Nested

[email]
type = "str"
name = "email"
# No path - stays at top level

[city]
type = "str"
name = "city"
path = "profile.location.city"  # Nested

Result:

{
  "id": 123,                    // Top-level (no path)
  "email": "user@example.com",  // Top-level (no path)
  "profile": {                  // Nested (has path)
    "name": {
      "first": "John"
    },
    "location": {
      "city": "Boston"
    }
  }
}

Real-World Example: Healthcare Data¶

Transform flat A&E (Accident & Emergency) data into structured documents:

[SHA]
type = "str"
name = "SHA"
path = "organization.sha_code"

[Code]
type = "str"
name = "Code"
path = "organization.code"

[Name]
type = "str"
name = "Name"
path = "organization.name"

[Type1_Attendances]
type = "int"
name = "Type1_Attendances"
path = "departments.type1.attendances"

[Type1_Over4Hours]
type = "int"
name = "Type1_Over4Hours"
path = "departments.type1.over_4_hours"

[Percentage_4Hours]
type = "float"
name = "Percentage_4Hours"
path = "performance.within_4_hours_pct"

[Admissions_Type1]
type = "int"
name = "Admissions_Type1"
path = "admissions.type1"

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

Result:

{
  "organization": {
    "sha_code": "Q44",
    "code": "REM",
    "name": "BURTON HOSPITALS NHS TRUST"
  },
  "departments": {
    "type1": {
      "attendances": 7523,
      "over_4_hours": 1234
    }
  },
  "performance": {
    "within_4_hours_pct": 83.6
  },
  "admissions": {
    "type1": 1456
  }
}

Real-World Example: NYC Taxi with Geospatial¶

Create GeoJSON-compatible nested structures for location data:

[VendorID]
type = "int"
name = "VendorID"
path = "vendor.id"

[tpep_pickup_datetime]
type = "datetime"
name = "tpep_pickup_datetime"
path = "pickup.datetime"
format = "%Y-%m-%d %H:%M:%S"

[pickup_longitude]
type = "float"
name = "pickup_longitude"
path = "pickup.location.coordinates.longitude"

[pickup_latitude]
type = "float"
name = "pickup_latitude"
path = "pickup.location.coordinates.latitude"

[dropoff_longitude]
type = "float"
name = "dropoff_longitude"
path = "dropoff.location.coordinates.longitude"

[dropoff_latitude]
type = "float"
name = "dropoff_latitude"
path = "dropoff.location.coordinates.latitude"

[passenger_count]
type = "int"
name = "passenger_count"
path = "trip.passengers"

[trip_distance]
type = "float"
name = "trip_distance"
path = "trip.distance"

[fare_amount]
type = "float"
name = "fare_amount"
path = "payment.fare"

[tip_amount]
type = "float"
name = "tip_amount"
path = "payment.tip"

[total_amount]
type = "float"
name = "total_amount"
path = "payment.total"

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

Result:

{
  "vendor": {
    "id": 1
  },
  "pickup": {
    "datetime": ISODate("2024-01-15T10:30:00Z"),
    "location": {
      "coordinates": {
        "longitude": -73.9851,
        "latitude": 40.7589
      }
    }
  },
  "dropoff": {
    "location": {
      "coordinates": {
        "longitude": -73.9731,
        "latitude": 40.7644
      }
    }
  },
  "trip": {
    "passengers": 2,
    "distance": 1.5
  },
  "payment": {
    "fare": 12.5,
    "tip": 2.5,
    "total": 15.0
  }
}

Note: For proper GeoJSON, you’d need Phase 2 features (field composition) to create the type: "Point" structure. This example shows the nested coordinates structure.

Querying Nested Documents¶

MongoDB queries work naturally with nested fields:

// Find by nested field
db.taxi.find({"vendor.id": 1})

// Range query on nested field
db.taxi.find({"payment.total": {$gte: 20}})

// Query multiple nested levels
db.taxi.find({
  "pickup.location.coordinates.latitude": {$gte: 40.7, $lte: 40.8}
})

// Complex query
db.taxi.find({
  "trip.distance": {$gte: 5},
  "payment.tip": {$gte: 5}
})

Create Indexes on Nested Fields¶

// Index on nested field
db.taxi.createIndex({"vendor.id": 1})

// Compound index with nested fields
db.taxi.createIndex({
  "pickup.datetime": 1,
  "vendor.id": 1
})

// Geospatial index (after restructuring to GeoJSON)
db.taxi.createIndex({"pickup.location": "2dsphere"})

Path Validation¶

PyImport validates paths to prevent conflicts:

❌ Invalid - Prefix Conflict:

[field1]
type = "str"
path = "address"

[field2]
type = "str"
path = "address.city"  # ERROR: Conflicts with "address"

❌ Invalid - Duplicate Path:

[field1]
type = "str"
path = "data.value"

[field2]
type = "str"
path = "data.value"  # ERROR: Duplicate path

✅ Valid:

[field1]
type = "str"
path = "address.city"

[field2]
type = "str"
path = "address.state"  # OK: Same parent, different fields

Backward Compatibility¶

100% backward compatible! All existing v1.0 field files work unchanged:

No path field? → Flat structure (v1.0 behavior)
Has path field? → Nested structure (v2.0 behavior)
Mixed? → Combination of both

Migrating from v1.0 to v2.0:

Start with existing v1.0 file
Add path to fields you want to nest
Leave others without path (they stay top-level)
Import works immediately - no other changes needed!

Performance¶

v1.0 files: No performance impact
v2.0 files: < 5% overhead for nested document construction
Large datasets: Minimal impact (tested with 100k+ rows)

When to Use v2.0¶

Use nested mapping when:

You want properly structured documents
Data has logical groupings (name parts, address parts, etc.)
You need to query nested fields efficiently
You want MongoDB schema validation on nested structures

Stick with v1.0 when:

Simple flat data is sufficient
No logical grouping in your data
Performance is absolutely critical (though v2.0 is only ~5% slower)

Common Patterns¶

Name Components:

path = "name.first"
path = "name.last"
path = "name.middle"

Address Components:

path = "address.street"
path = "address.city"
path = "address.state"
path = "address.postal_code"
path = "address.country"

Contact Information:

path = "contact.email"
path = "contact.phone.mobile"
path = "contact.phone.home"

Metrics/Stats:

path = "stats.count"
path = "stats.average"
path = "stats.maximum"
path = "stats.minimum"

Timestamps:

path = "metadata.created_at"
path = "metadata.updated_at"
path = "metadata.deleted_at"

Field Files (.tff)¶

What is a Field File?¶

Generating Field Files¶

Field File Format¶

Basic Example¶

Supported Types¶

String (str)¶

Integer (int)¶

Float (float)¶

Boolean (bool)¶

Date (date)¶

DateTime (datetime)¶

ISO Date (isodate)¶

Timestamp (timestamp)¶

Date Format Strings¶

Common Date Format Examples¶

Defaults Section¶

Options¶

Advanced Field File Examples¶

Mixed Date Formats¶

Optional Format Inference¶

NULL and Empty String Handling¶

Type Conversion Behavior¶

Successful Conversion¶

Failed Conversion (Default: Warn)¶

Failed Conversion (Strict Mode)¶

Field File Discovery¶

Creating Field Files Manually¶

Step 1: Look at Your CSV¶

Step 2: Determine Types¶

Step 3: Create .tff File¶

Step 4: Test¶

Troubleshooting¶

“Field count mismatch”¶

Dates Not Converting¶

Numbers Stored as Strings¶

Field File Not Found¶

Performance Tips¶

Field File Examples¶

Financial Data¶

Log Data¶

NYC Taxi Data¶

TFF v2.0: Nested Document Mapping (NEW!)¶

Why Use Nested Mapping?¶

Basic Nested Mapping¶

Path Notation¶

Mixed v1.0 and v2.0 Fields¶

Real-World Example: Healthcare Data¶

Real-World Example: NYC Taxi with Geospatial¶

Querying Nested Documents¶

Create Indexes on Nested Fields¶

Path Validation¶

Backward Compatibility¶

Performance¶

When to Use v2.0¶

Common Patterns¶

See Also¶

String (`str`)¶

Integer (`int`)¶

Float (`float`)¶

Boolean (`bool`)¶

Date (`date`)¶

DateTime (`datetime`)¶

ISO Date (`isodate`)¶

Timestamp (`timestamp`)¶

Step 3: Create `.tff` File¶