Sift Reference
Data scientists spend a surprising amount of time wrangling text. Log files, CSV exports, JSON dumps, configuration files—before any modeling can begin, there's often a messy pile of text that needs cleaning, filtering, and reshaping. I built Sift to bring SQL to this problem, letting you query and transform text files with the same language you use for databases. This reference covers every flag and option, with examples I've drawn from real data analysis workflows.
Contents
Command Modes
Sift operates in several distinct modes, each designed for a specific workflow. The mode determines how input is processed and what tables are available in your SQL queries.
--for "QUERY"
Execute SQL on standard input. Each line becomes a row in the lines table with line_number and content columns.
This is the foundation of Sift. Pipe any text into it, and you can immediately query it with SQL. For data analysis, this is invaluable when you need to quickly explore, filter, or transform data exports.
Example: Filtering a CSV for outliers
cat sales.csv | sift --for "
SELECT content FROM lines
WHERE line_number > 1
AND CAST(
substr(content, instr(content, ',') + 1,
instr(substr(content, instr(content, ',') + 1), ',') - 1
) AS REAL) > 10000
"
For cleaner CSV handling, use the regex functions:
cat sales.csv | sift --for "
SELECT content FROM lines
WHERE regex_match('[0-9]+\.[0-9]{2}', content)
AND CAST(regex_extract('([0-9]+\.[0-9]{2})', content, 1) AS REAL) > 10000
"
Example: Counting log levels
cat application.log | sift --for "
SELECT
CASE
WHEN content LIKE '%ERROR%' THEN 'ERROR'
WHEN content LIKE '%WARN%' THEN 'WARN'
WHEN content LIKE '%INFO%' THEN 'INFO'
ELSE 'OTHER'
END as level,
COUNT(*) as count
FROM lines
GROUP BY level
ORDER BY count DESC
"
--dig --for "QUERY"
Search across multiple files. Reads file paths from stdin, indexes their contents, and exposes files, lines, and search_fts tables.
When your data spans multiple files—monthly exports, partitioned datasets, distributed logs—--dig lets you query across all of them at once. The FTS5 full-text search index enables sophisticated pattern matching.
Example: Finding correlated events across log files
find /var/log -name "*.log" -mtime -7 | sift --dig --for "
SELECT filepath, content
FROM search_fts
WHERE content MATCH 'timeout AND database'
"
Example: Aggregating metrics from multiple JSON exports
find ./exports -name "metrics_*.json" | sift --dig --for "
SELECT
filepath,
COUNT(*) as line_count,
SUM(CASE WHEN content LIKE '%error%' THEN 1 ELSE 0 END) as error_mentions
FROM lines
JOIN files USING (file_id)
GROUP BY filepath
ORDER BY error_mentions DESC
"
Example: Dataset inventory
find ./data -name "*.csv" | sift --dig --for "
SELECT filepath, line_count as rows FROM files ORDER BY line_count DESC
"
--refine FILE --for "QUERY"
Transform an entire file using SQL. The query must return a content column, and the output replaces the file contents.
Use this for whole-file transformations: reformatting, sorting, deduplication. The query processes every line and the results become the new file.
Example: Sorting a dataset by a column
sift --refine data.csv --for "
SELECT content FROM lines
ORDER BY
CASE WHEN line_number = 1 THEN 0 ELSE 1 END,
CAST(regex_extract('^([^,]+)', content, 1) AS INTEGER)
"
This keeps the header row first, then sorts remaining rows by the first column numerically.
Example: Removing duplicate lines
sift --refine duplicates.txt --for "
SELECT content FROM lines GROUP BY content ORDER BY MIN(line_number)
"
Example: Normalizing whitespace in a dataset
sift --refine messy.csv --for "
SELECT trim(regex_replace('\s+', content, ' ')) as content
FROM lines
ORDER BY line_number
"
--pick FILE --for "QUERY"
Surgical line editing. The query must return line_number and content columns. Only the specified lines are modified; others remain unchanged.
When you need to fix specific rows without touching the rest of the file, --pick is precise. This is essential for cleaning data where only certain records have issues.
Example: Fixing malformed dates in specific rows
sift --pick transactions.csv --for "
SELECT line_number,
regex_replace('(\d{2})/(\d{2})/(\d{4})', content, '\$3-\$1-\$2') as content
FROM lines
WHERE content LIKE '%/__/____,%'
"
This converts MM/DD/YYYY to YYYY-MM-DD only on lines that have the old format.
Example: Anonymizing PII in flagged records
sift --pick users.csv --for "
SELECT line_number,
regex_replace('[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}', content, '[REDACTED]') as content
FROM lines
WHERE content LIKE '%@example.com%'
"
Example: Correcting a known bad value
sift --pick sensor_data.csv --for "
SELECT line_number, replace(content, '-999.0', 'NULL') as content
FROM lines
WHERE content LIKE '%-999.0%'
"
--sweep --for "QUERY"
Batch editing across multiple files. Reads file paths from stdin. Query must return filepath, line_number, and content.
For applying the same transformation across an entire dataset directory, --sweep processes multiple files in one pass.
Example: Standardizing column names across all CSVs
find ./data -name "*.csv" | sift --sweep --for "
SELECT f.filepath, l.line_number,
replace(replace(l.content, 'user_id', 'customer_id'), 'User ID', 'customer_id') as content
FROM lines l
JOIN files f USING (file_id)
WHERE l.line_number = 1
"
Example: Removing a deprecated field from all JSON files
find ./configs -name "*.json" | sift --sweep --for "
SELECT f.filepath, l.line_number,
regex_replace('\"legacy_flag\":\s*(true|false),?\s*', l.content, '') as content
FROM lines l
JOIN files f USING (file_id)
WHERE l.content LIKE '%legacy_flag%'
"
--drop-after N FILE
Insert content after line N. Use --content "text" to specify what to insert, or pipe content via stdin.
Example: Adding a computed column header
echo ",profit_margin" | sift --drop-after 0 sales.csv
This prepends to line 1 (after line 0), adding a new column to the header.
Example: Inserting a data quality note
sift --drop-after 1 report.csv --content "# Data extracted on $(date), source: production DB"
--drop-before N FILE
Insert content before line N.
Example: Adding a schema header to a headerless CSV
sift --drop-before 1 raw_data.csv --content "timestamp,sensor_id,temperature,humidity,pressure"
Example: Prepending a license header to data files
sift --drop-before 1 dataset.csv --content "# Licensed under CC-BY-4.0. Attribution: Example Corp."
--peek FILE
Read and display a file with line numbers. Combine with --layer N-M to view specific ranges.
Before transforming data, you often need to inspect it. --peek shows line numbers, making it easy to identify which rows need attention.
Example: Inspecting the structure of a large file
sift --peek huge_dataset.csv --layer 1-5
1 customer_id,order_date,amount,product_id,region
2 1001,2024-01-15,250.00,SKU-001,WEST
3 1002,2024-01-15,125.50,SKU-002,EAST
4 1003,2024-01-16,89.99,SKU-001,WEST
5 1004,2024-01-16,432.00,SKU-003,NORTH
Example: Checking the end of a log file
sift --peek application.log --layer 9995-10000
--quarry [ACTION]
Manage the persistent workspace index. Actions: init, status, refresh, rebuild.
For large codebases or datasets, Sift can maintain a persistent index that speeds up repeated queries. The index is stored in a .sift/ directory.
Example: Setting up an index for a data warehouse
cd /data/warehouse
sift --quarry init
Example: Checking index status
sift --quarry status
Workspace: /data/warehouse
Database: .sift/workspace.db (45.2 MB)
Files: 1,247 indexed
Lines: 3,891,024 total
Indexed: 2 hours ago
Stale: 12 files changed since last index
Example: Updating after new data arrives
sift --quarry refresh
Output Options
--grain FORMAT
Set output format. Options: plain (default), tsv, csv, json, ndjson, grep.
Data pipelines often require specific formats. --grain lets Sift output directly to the format your downstream tools expect.
Example: Exporting query results as JSON for a dashboard
cat sales.csv | sift --grain json --for "
SELECT
substr(content, 1, instr(content, ',') - 1) as date,
COUNT(*) as transactions
FROM lines
WHERE line_number > 1
GROUP BY date
"
[
{"date": "2024-01-15", "transactions": 42},
{"date": "2024-01-16", "transactions": 38},
{"date": "2024-01-17", "transactions": 55}
]
Example: Creating TSV for spreadsheet import
cat data.csv | sift --grain tsv --for "
SELECT line_number, content FROM lines WHERE content LIKE '%ERROR%'
"
Example: NDJSON for streaming to Elasticsearch
cat events.log | sift --grain ndjson --for "
SELECT
regex_extract('^(\d{4}-\d{2}-\d{2})', content, 1) as date,
regex_extract('\[(.*?)\]', content, 1) as level,
content as raw
FROM lines
"
--count, -c
Output only the row count, not the data.
Quick sanity checks and data validation often just need counts, not full results.
Example: Counting records matching a condition
cat transactions.csv | sift -c --for "
SELECT * FROM lines WHERE content LIKE '%FAILED%'
"
247
Example: Validating expected row count
EXPECTED=10000
ACTUAL=$(cat data.csv | sift -c --for "SELECT * FROM lines WHERE line_number > 1")
if [ "$ACTUAL" -ne "$EXPECTED" ]; then
echo "Row count mismatch: expected $EXPECTED, got $ACTUAL"
fi
--head N
Limit output to the first N rows.
Example: Previewing query results
cat huge_file.csv | sift --head 10 --for "
SELECT content FROM lines WHERE content LIKE '%anomaly%'
"
Example: Top N analysis
cat sales.csv | sift --head 5 --for "
SELECT content FROM lines
WHERE line_number > 1
ORDER BY CAST(regex_extract(',([0-9.]+)', content, 1) AS REAL) DESC
"
--tail N
Limit output to the last N rows.
Example: Most recent log entries
cat application.log | sift --tail 20 --for "SELECT content FROM lines"
Example: Bottom performers in a ranking
cat performance.csv | sift --tail 10 --for "
SELECT content FROM lines
WHERE line_number > 1
ORDER BY CAST(regex_extract(',([0-9.]+)', content, 1) AS REAL) DESC
"
File Reading
--layer N-M
Limit file reading to lines N through M. Use with --peek or when processing large files.
When files are enormous, reading the entire thing wastes resources. --layer lets you focus on specific sections.
Example: Sampling the middle of a sorted dataset
sift --peek sorted_by_date.csv --layer 5000-5010
Example: Extracting a date range from time-ordered data
# First, find where January data starts
cat timeseries.csv | sift --for "
SELECT line_number FROM lines WHERE content LIKE '2024-01-%' LIMIT 1
"
# Then extract that range
sift --peek timeseries.csv --layer 1042-2156
Editing & Safety
--shake
Preview changes without modifying files (dry run).
Before committing changes, especially on production data, always preview first.
Example: Previewing a batch transformation
find ./data -name "*.csv" | sift --sweep --shake --for "
SELECT f.filepath, l.line_number,
replace(l.content, 'N/A', '') as content
FROM lines l JOIN files f USING (file_id)
WHERE l.content LIKE '%N/A%'
"
[dry-run] Would modify 3 line(s) in ./data/q1.csv
[dry-run] Would modify 7 line(s) in ./data/q2.csv
[batch-edit] Would modify 2 file(s)
--diff
Show a unified diff of changes.
When you need to see exactly what will change, line by line.
Example: Reviewing a data cleaning operation
sift --pick data.csv --diff --for "
SELECT line_number, trim(content) as content
FROM lines WHERE content != trim(content)
"
Update(data.csv)
Added 2 lines, removed 2 lines
1 - customer_id,amount,date
1 + customer_id,amount,date
15- 1042, 250.00, 2024-01-15
15+ 1042, 250.00, 2024-01-15
--no-backup
Skip creating .bak backup files when editing.
By default, Sift creates backups before modifying files. Use this flag in automated pipelines where you have other backup mechanisms.
Example: Pipeline with version control
# Files are in git, no need for .bak files
sift --refine data.csv --no-backup --for "
SELECT upper(content) as content FROM lines ORDER BY line_number
"
Advanced Features
--weigh
Include query execution statistics.
When optimizing queries on large datasets, timing information helps identify bottlenecks.
Example: Profiling a complex aggregation
cat large_dataset.csv | sift --weigh --for "
SELECT
regex_extract('^([^,]+)', content, 1) as category,
COUNT(*) as count,
AVG(CAST(regex_extract(',([0-9.]+)', content, 1) AS REAL)) as avg_value
FROM lines
WHERE line_number > 1
GROUP BY category
"
category count avg_value
ELECTRONICS 15234 127.45
CLOTHING 23421 45.67
FOOD 8932 12.34
-- Stats: 47587 rows scanned, 3 rows returned, 0.045s execution time
--sieve [N]
Watch mode—re-run the query when files change. Optional interval in seconds.
For monitoring live data feeds or log files, --sieve provides continuous updates. Note: Not available in MCP mode.
Example: Live error monitoring
sift --sieve 5 --peek /var/log/app.log --layer -50 | grep ERROR
Example: Watching a data ingestion directory
find ./incoming -name "*.csv" | sift --sieve 10 --dig --for "
SELECT filepath, line_count FROM files ORDER BY filepath
"
--plugin PATH
Load additional SQL or C plugins for extended functionality.
Plugins add new SQL functions. Sift ships with several useful ones.
Example: Using the CSV parser plugin
cat data.csv | sift --plugin plugins/csv.so --for "
SELECT csv_field(content, 3) as third_column FROM lines
"
Example: Base64 encoding for data transfer
cat sensitive.csv | sift --plugin plugins/encoding.so --for "
SELECT base64_encode(content) as encoded FROM lines
"
--quiet, -q
Suppress non-essential output (banners, progress messages).
For scripting and automation, clean output matters.
Example: In a shell script
COUNT=$(cat data.csv | sift -q -c --for "SELECT * FROM lines WHERE line_number > 1")
echo "Processed $COUNT records"
--mcp
Run as an MCP (Model Context Protocol) server for AI agent integration.
This enables AI assistants like Claude to use Sift for code and data exploration.
Example: Registering with Claude Code
claude mcp add sift -- sift --mcp
--list-plugins
List available plugins and their functions.
sift --list-plugins
Loaded plugins:
fields.sql - Awk-style field extraction (f1, f2, ...)
dedup.sql - Deduplication views
statistics.sql - Line length statistics
encoding.so - Base64, Hex, URL encoding functions
csv.so - RFC 4180 CSV parsing
Quick Reference Table
For quick lookup, here's every flag organized by category:
| Category | Flag | Description |
|---|---|---|
| Command Modes | --for "QUERY" |
SQL on stdin |
--dig --for "QUERY" |
Multi-file search | |
--refine FILE --for "Q" |
Transform entire file | |
--pick FILE --for "QUERY" |
Edit specific lines | |
--sweep --for "QUERY" |
Batch edit multiple files | |
--drop-after N FILE |
Insert after line N | |
--drop-before N FILE |
Insert before line N | |
--peek FILE |
Read file with line numbers | |
--quarry [ACTION] |
Manage workspace index | |
--mcp |
MCP server mode | |
| Output | --grain FORMAT |
plain, tsv, csv, json, ndjson, grep |
--count, -c |
Row count only | |
--head N |
First N rows | |
--tail N |
Last N rows | |
| File Reading | --layer N-M |
Line range (use with --peek) |
| Safety | --shake |
Dry run (preview changes) |
--diff |
Show unified diff | |
--no-backup |
Skip .bak files | |
| Advanced | --weigh |
Query statistics |
--sieve [N] |
Watch mode (re-run on changes) | |
--plugin PATH |
Load extension | |
--quiet, -q |
Suppress banners | |
| Info | --help, -h |
Show help |
--version, -V |
Show version | |
--list-plugins |
List loaded plugins |