# Field Files (.tff) Field files define how CSV columns should be parsed and converted to MongoDB types. They use TOML format (`.tff` extension). ## What is a Field File? A field file maps each CSV column to a data type and optionally a format specification. When PyImport reads your CSV, it uses the field file to: 1. Convert strings to appropriate types (int, float, date, etc.) 2. Handle date/time formats 3. Validate data 4. Fall back to strings on conversion errors (unless `--onerror Fail` is used) ## Generating Field Files The easiest way to create a field file is to let PyImport infer types: ```bash pyimport --genfieldfile data.csv ``` This analyzes your CSV and creates `data.tff` with detected types. ## Field File Format Field files use TOML syntax. Each field has its own section: ```toml [column_name] type = "type_name" format = "format_string" # Optional, for dates ``` ### Basic Example Given this CSV: ```csv name,age,salary,active Alice,30,75000.50,true Bob,25,65000.00,false ``` Field file (`data.tff`): ```toml [name] type = "str" [age] type = "int" [salary] type = "float" [active] type = "bool" # Now supported in v2.0.8+ ``` ## Supported Types ### String (`str`) Text data. No conversion applied. ```toml [name] type = "str" ``` ```csv name Alice Bob ``` Result: `{"name": "Alice"}` ### Integer (`int`) Whole numbers. Handles conversion from float strings. ```toml [age] type = "int" ``` ```csv age 30 42.0 # Converts to 42 ``` Result: `{"age": 30}` ### Float (`float`) Decimal numbers. ```toml [salary] type = "float" ``` ```csv salary 75000.50 80000 ``` Result: `{"salary": 75000.50}` ### Boolean (`bool`) **New in v2.0.8!** - Boolean/logical values. ```toml [is_active] type = "bool" ``` ```csv is_active true false yes no 1 0 ``` Result: `{"is_active": true}` or `{"is_active": false}` **Accepted values (case-insensitive):** - `true`, `t`, `yes`, `y`, `1` → `true` - `false`, `f`, `no`, `n`, `0` → `false` **Note:** Prior to v2.0.8, boolean values had to be stored as strings. ### Date (`date`) Date values without time component. **Without format (slow):** ```toml [birth_date] type = "date" ``` Uses `dateutil.parser` - flexible but slow (~10-100x slower). **With format (fast):** ```toml [birth_date] type = "date" format = "%Y-%m-%d" ``` Uses `strptime` - fast and predictable. ```csv birth_date 1990-05-15 1985-12-20 ``` Result: `{"birth_date": ISODate("1990-05-15T00:00:00Z")}` ### DateTime (`datetime`) Date with time component. **Without format:** ```toml [created_at] type = "datetime" ``` **With format:** ```toml [created_at] type = "datetime" format = "%Y-%m-%d %H:%M:%S" ``` ```csv created_at 2024-01-15 14:30:00 2024-01-16 09:15:30 ``` Result: `{"created_at": ISODate("2024-01-15T14:30:00Z")}` ### ISO Date (`isodate`) **Fastest date parsing** - specifically for ISO 8601 format (YYYY-MM-DD). ```toml [date] type = "isodate" ``` ```csv date 2024-01-15 2024-01-16 ``` **Performance:** ~100x faster than generic date parsing. Use this whenever possible! ### Timestamp (`timestamp`) Unix timestamp (seconds since epoch). ```toml [event_time] type = "timestamp" ``` ```csv event_time 1705334400 1705420800 ``` Result: `{"event_time": ISODate("2024-01-15T12:00:00Z")}` ## Date Format Strings Format strings use Python's `strptime` syntax: | Code | Meaning | Example | |------|---------|---------| | `%Y` | 4-digit year | 2024 | | `%y` | 2-digit year | 24 | | `%m` | Month (01-12) | 01 | | `%d` | Day (01-31) | 15 | | `%H` | Hour 24h (00-23) | 14 | | `%I` | Hour 12h (01-12) | 02 | | `%M` | Minute (00-59) | 30 | | `%S` | Second (00-59) | 45 | | `%p` | AM/PM | PM | | `%b` | Abbreviated month | Jan | | `%B` | Full month | January | | `%a` | Abbreviated day | Mon | | `%A` | Full day | Monday | ### Common Date Format Examples ```toml # US date: 01/15/2024 [date] type = "date" format = "%m/%d/%Y" # European date: 15/01/2024 [date] type = "date" format = "%d/%m/%Y" # ISO date: 2024-01-15 [date] type = "isodate" # No format needed! # Date with time: 2024-01-15 14:30:00 [datetime] type = "datetime" format = "%Y-%m-%d %H:%M:%S" # 12-hour time: 01/15/2024 02:30 PM [datetime] type = "datetime" format = "%m/%d/%Y %I:%M %p" # Long format: January 15, 2024 [date] type = "date" format = "%B %d, %Y" ``` ## Defaults Section The `DEFAULTS_SECTION` defines CSV parsing options: ```toml [DEFAULTS_SECTION] delimiter = "," has_header = true ``` ### Options - `delimiter`: Field separator (default: `,`) - Use `\t` for tab - Any string: `|`, `;`, etc. - `has_header`: Whether first row is header (default: `true`) Example with pipe delimiter: ```toml [DEFAULTS_SECTION] delimiter = "|" has_header = true [name] type = "str" [age] type = "int" ``` ## Advanced Field File Examples ### Mixed Date Formats You can have different date formats for different columns: ```toml [birth_date] type = "date" format = "%m/%d/%Y" # US format [hire_date] type = "isodate" # ISO format (fastest) [last_login] type = "datetime" format = "%Y-%m-%d %H:%M:%S" ``` ### Optional Format Inference Let PyImport infer date formats: ```bash pyimport --genfieldfile data.csv ``` Result: ```toml [date_column] type = "date" format = "%Y-%m-%d" # Automatically detected! ``` ### NULL and Empty String Handling PyImport handles `NULL` and empty strings gracefully: ```csv name,age,city Alice,30,NYC Bob,, # Empty age, empty city Charlie,35,NULL ``` With this field file: ```toml [name] type = "str" [age] type = "int" [city] type = "str" ``` Result: ```json {"name": "Alice", "age": 30, "city": "NYC"} {"name": "Bob", "age": "", "city": ""} // Falls back to string {"name": "Charlie", "age": 35, "city": "NULL"} // Stored as string "NULL" ``` ## Type Conversion Behavior ### Successful Conversion ```csv age 30 42 25 ``` ```toml [age] type = "int" ``` Result: All values stored as integers. ### Failed Conversion (Default: Warn) ```csv age 30 invalid 25 ``` With `--onerror Warn` (default): - Logs warning about "invalid" - Stores "invalid" as string - Continues processing Result: ```json {"age": 30} {"age": "invalid"} // Fallback to string {"age": 25} ``` ### Failed Conversion (Strict Mode) With `--onerror Fail`: ```bash pyimport --onerror Fail data.csv ``` Stops on first conversion error. ## Field File Discovery PyImport looks for field files in this order: 1. **Explicit:** `--fieldfile custom.tff` 2. **Auto-discovery:** Same name as CSV with `.tff` extension - `data.csv` → looks for `data.tff` - `users.csv` → looks for `users.tff` 3. **Auto-generation:** If no field file found and has header, attempts to infer types ## Creating Field Files Manually ### Step 1: Look at Your CSV ```csv name,age,salary,join_date,is_active Alice,30,75000,2020-01-15,true Bob,25,65000,2021-03-22,false ``` ### Step 2: Determine Types - `name`: String - `age`: Integer - `salary`: Integer (could be float if decimals) - `join_date`: ISO date - `is_active`: String (MongoDB has no native bool from CSV) ### Step 3: Create `.tff` File ```toml [DEFAULTS_SECTION] delimiter = "," has_header = true [name] type = "str" [age] type = "int" [salary] type = "int" [join_date] type = "isodate" [is_active] type = "str" ``` ### Step 4: Test ```bash # Import first 10 rows to verify pyimport --limit 10 --loglevel DEBUG \ --database test --collection sample \ --fieldfile myfields.tff \ data.csv # Check results mongosh test --eval "db.sample.find().pretty()" ``` ## Troubleshooting ### "Field count mismatch" Your CSV has inconsistent column counts. **Solution:** Check your data for: - Missing delimiters - Wrong delimiter (using `,` when data uses `|`) - Extra delimiters in quoted fields ### Dates Not Converting **Problem:** Dates stored as strings instead of Date objects. **Solutions:** 1. Check format string matches your data exactly 2. Use `isodate` type for ISO format (YYYY-MM-DD) 3. Let PyImport infer format: `--genfieldfile` 4. Verify sample data with `--limit 10 --loglevel DEBUG` ### Numbers Stored as Strings **Problem:** Numeric values stored as strings. **Solution:** Check field file has correct type: ```toml [price] type = "float" # Not "str" ``` ### Field File Not Found **Problem:** `pyimport` can't find `.tff` file. **Solutions:** ```bash # Generate field file pyimport --genfieldfile data.csv # Or specify explicitly pyimport --fieldfile path/to/fields.tff data.csv ``` ## Performance Tips 1. **Use `isodate` for ISO dates** - 100x faster than generic parsing 2. **Always specify date formats** - `format` parameter makes parsing much faster 3. **Use appropriate types** - Don't use `str` for everything 4. **Test with `--limit`** - Verify field file with small sample first ## Field File Examples ### Financial Data ```toml [DEFAULTS_SECTION] delimiter = "," has_header = true [transaction_id] type = "str" [date] type = "isodate" [amount] type = "float" [quantity] type = "int" [symbol] type = "str" [timestamp] type = "datetime" format = "%Y-%m-%d %H:%M:%S" ``` ### Log Data ```toml [DEFAULTS_SECTION] delimiter = "|" has_header = false [timestamp] type = "timestamp" [level] type = "str" [message] type = "str" [user_id] type = "int" ``` ### NYC Taxi Data ```toml [DEFAULTS_SECTION] delimiter = "," has_header = true [VendorID] type = "int" [tpep_pickup_datetime] type = "datetime" format = "%Y-%m-%d %H:%M:%S" [tpep_dropoff_datetime] type = "datetime" format = "%Y-%m-%d %H:%M:%S" [passenger_count] type = "int" [trip_distance] type = "float" [fare_amount] type = "float" [tip_amount] type = "float" [total_amount] type = "float" ``` ## TFF v2.0: Nested Document Mapping (NEW!) **Available in pyimport 2.0+** TFF v2.0 format allows you to map flat CSV data to nested JSON/MongoDB documents. This is perfect for creating properly structured documents instead of flat key-value pairs. ### Why Use Nested Mapping? **Before (v1.0 - Flat Structure):** ```json { "first_name": "John", "last_name": "Doe", "street": "123 Main St", "city": "Boston", "state": "MA", "zip": "02101" } ``` **After (v2.0 - Nested Structure):** ```json { "name": { "first": "John", "last": "Doe" }, "address": { "street": "123 Main St", "city": "Boston", "state": "MA", "postal_code": "02101" } } ``` ### Basic Nested Mapping Add a `path` field to specify where the value should be placed in the nested structure: ```toml [first_name] type = "str" name = "first_name" path = "name.first" # NEW: Nested path using dot notation format = "" [last_name] type = "str" name = "last_name" path = "name.last" format = "" [city] type = "str" name = "city" path = "address.city" format = "" [state] type = "str" name = "state" path = "address.state" format = "" [DEFAULTS_SECTION] delimiter = "," has_header = true ``` ### Path Notation Use dot notation to specify nested levels: - `"name"` → Top-level field - `"user.name"` → One level: `{user: {name: value}}` - `"user.profile.name"` → Two levels: `{user: {profile: {name: value}}}` - `"data.metrics.score.final"` → Three levels **No limit** on nesting depth! ### Mixed v1.0 and v2.0 Fields You can mix fields with and without paths in the same file: ```toml [id] type = "int" name = "id" # No path - stays at top level [first_name] type = "str" name = "first_name" path = "profile.name.first" # Nested [email] type = "str" name = "email" # No path - stays at top level [city] type = "str" name = "city" path = "profile.location.city" # Nested ``` Result: ```json { "id": 123, // Top-level (no path) "email": "user@example.com", // Top-level (no path) "profile": { // Nested (has path) "name": { "first": "John" }, "location": { "city": "Boston" } } } ``` ### Real-World Example: Healthcare Data Transform flat A&E (Accident & Emergency) data into structured documents: ```toml [SHA] type = "str" name = "SHA" path = "organization.sha_code" [Code] type = "str" name = "Code" path = "organization.code" [Name] type = "str" name = "Name" path = "organization.name" [Type1_Attendances] type = "int" name = "Type1_Attendances" path = "departments.type1.attendances" [Type1_Over4Hours] type = "int" name = "Type1_Over4Hours" path = "departments.type1.over_4_hours" [Percentage_4Hours] type = "float" name = "Percentage_4Hours" path = "performance.within_4_hours_pct" [Admissions_Type1] type = "int" name = "Admissions_Type1" path = "admissions.type1" [DEFAULTS_SECTION] delimiter = "," has_header = true ``` Result: ```json { "organization": { "sha_code": "Q44", "code": "REM", "name": "BURTON HOSPITALS NHS TRUST" }, "departments": { "type1": { "attendances": 7523, "over_4_hours": 1234 } }, "performance": { "within_4_hours_pct": 83.6 }, "admissions": { "type1": 1456 } } ``` ### Real-World Example: NYC Taxi with Geospatial Create GeoJSON-compatible nested structures for location data: ```toml [VendorID] type = "int" name = "VendorID" path = "vendor.id" [tpep_pickup_datetime] type = "datetime" name = "tpep_pickup_datetime" path = "pickup.datetime" format = "%Y-%m-%d %H:%M:%S" [pickup_longitude] type = "float" name = "pickup_longitude" path = "pickup.location.coordinates.longitude" [pickup_latitude] type = "float" name = "pickup_latitude" path = "pickup.location.coordinates.latitude" [dropoff_longitude] type = "float" name = "dropoff_longitude" path = "dropoff.location.coordinates.longitude" [dropoff_latitude] type = "float" name = "dropoff_latitude" path = "dropoff.location.coordinates.latitude" [passenger_count] type = "int" name = "passenger_count" path = "trip.passengers" [trip_distance] type = "float" name = "trip_distance" path = "trip.distance" [fare_amount] type = "float" name = "fare_amount" path = "payment.fare" [tip_amount] type = "float" name = "tip_amount" path = "payment.tip" [total_amount] type = "float" name = "total_amount" path = "payment.total" [DEFAULTS_SECTION] delimiter = "," has_header = true ``` Result: ```json { "vendor": { "id": 1 }, "pickup": { "datetime": ISODate("2024-01-15T10:30:00Z"), "location": { "coordinates": { "longitude": -73.9851, "latitude": 40.7589 } } }, "dropoff": { "location": { "coordinates": { "longitude": -73.9731, "latitude": 40.7644 } } }, "trip": { "passengers": 2, "distance": 1.5 }, "payment": { "fare": 12.5, "tip": 2.5, "total": 15.0 } } ``` **Note:** For proper GeoJSON, you'd need Phase 2 features (field composition) to create the `type: "Point"` structure. This example shows the nested coordinates structure. ### Querying Nested Documents MongoDB queries work naturally with nested fields: ```javascript // Find by nested field db.taxi.find({"vendor.id": 1}) // Range query on nested field db.taxi.find({"payment.total": {$gte: 20}}) // Query multiple nested levels db.taxi.find({ "pickup.location.coordinates.latitude": {$gte: 40.7, $lte: 40.8} }) // Complex query db.taxi.find({ "trip.distance": {$gte: 5}, "payment.tip": {$gte: 5} }) ``` ### Create Indexes on Nested Fields ```javascript // Index on nested field db.taxi.createIndex({"vendor.id": 1}) // Compound index with nested fields db.taxi.createIndex({ "pickup.datetime": 1, "vendor.id": 1 }) // Geospatial index (after restructuring to GeoJSON) db.taxi.createIndex({"pickup.location": "2dsphere"}) ``` ### Path Validation PyImport validates paths to prevent conflicts: **❌ Invalid - Prefix Conflict:** ```toml [field1] type = "str" path = "address" [field2] type = "str" path = "address.city" # ERROR: Conflicts with "address" ``` **❌ Invalid - Duplicate Path:** ```toml [field1] type = "str" path = "data.value" [field2] type = "str" path = "data.value" # ERROR: Duplicate path ``` **✅ Valid:** ```toml [field1] type = "str" path = "address.city" [field2] type = "str" path = "address.state" # OK: Same parent, different fields ``` ### Backward Compatibility **100% backward compatible!** All existing v1.0 field files work unchanged: - No `path` field? → Flat structure (v1.0 behavior) - Has `path` field? → Nested structure (v2.0 behavior) - Mixed? → Combination of both **Migrating from v1.0 to v2.0:** 1. Start with existing v1.0 file 2. Add `path` to fields you want to nest 3. Leave others without `path` (they stay top-level) 4. Import works immediately - no other changes needed! ### Performance - **v1.0 files**: No performance impact - **v2.0 files**: < 5% overhead for nested document construction - **Large datasets**: Minimal impact (tested with 100k+ rows) ### When to Use v2.0 **Use nested mapping when:** - You want properly structured documents - Data has logical groupings (name parts, address parts, etc.) - You need to query nested fields efficiently - You want MongoDB schema validation on nested structures **Stick with v1.0 when:** - Simple flat data is sufficient - No logical grouping in your data - Performance is absolutely critical (though v2.0 is only ~5% slower) ### Common Patterns **Name Components:** ```toml path = "name.first" path = "name.last" path = "name.middle" ``` **Address Components:** ```toml path = "address.street" path = "address.city" path = "address.state" path = "address.postal_code" path = "address.country" ``` **Contact Information:** ```toml path = "contact.email" path = "contact.phone.mobile" path = "contact.phone.home" ``` **Metrics/Stats:** ```toml path = "stats.count" path = "stats.average" path = "stats.maximum" path = "stats.minimum" ``` **Timestamps:** ```toml path = "metadata.created_at" path = "metadata.updated_at" path = "metadata.deleted_at" ``` ## See Also - [Quick Start](quickstart.md) - Basic usage examples - [Command-Line Reference](cli_reference.md) - All CLI options - [Advanced Usage](advanced.md) - Optimization and troubleshooting - [TFF v2.0 Migration Guide](v2_migration_guide.md) - Detailed migration guide