Sift Reference

Complete documentation

Data scientists spend a surprising amount of time wrangling text. Log files, CSV exports, JSON dumps, configuration files—before any modeling can begin, there's often a messy pile of text that needs cleaning, filtering, and reshaping. I built Sift to bring SQL to this problem, letting you query and transform text files with the same language you use for databases. This reference covers every flag and option, with examples I've drawn from real data analysis workflows.

Contents

Command Modes

Sift operates in several distinct modes, each designed for a specific workflow. The mode determines how input is processed and what tables are available in your SQL queries.

--for "QUERY"

Execute SQL on standard input. Each line becomes a row in the lines table with line_number and content columns.

This is the foundation of Sift. Pipe any text into it, and you can immediately query it with SQL. For data analysis, this is invaluable when you need to quickly explore, filter, or transform data exports.

Example: Filtering a CSV for outliers

cat sales.csv | sift --for "
  SELECT content FROM lines
  WHERE line_number > 1
  AND CAST(
    substr(content, instr(content, ',') + 1,
           instr(substr(content, instr(content, ',') + 1), ',') - 1
    ) AS REAL) > 10000
"

For cleaner CSV handling, use the regex functions:

cat sales.csv | sift --for "
  SELECT content FROM lines
  WHERE regex_match('[0-9]+\.[0-9]{2}', content)
  AND CAST(regex_extract('([0-9]+\.[0-9]{2})', content, 1) AS REAL) > 10000
"

Example: Counting log levels

cat application.log | sift --for "
  SELECT
    CASE
      WHEN content LIKE '%ERROR%' THEN 'ERROR'
      WHEN content LIKE '%WARN%' THEN 'WARN'
      WHEN content LIKE '%INFO%' THEN 'INFO'
      ELSE 'OTHER'
    END as level,
    COUNT(*) as count
  FROM lines
  GROUP BY level
  ORDER BY count DESC
"

--dig --for "QUERY"

Search across multiple files. Reads file paths from stdin, indexes their contents, and exposes files, lines, and search_fts tables.

When your data spans multiple files—monthly exports, partitioned datasets, distributed logs—--dig lets you query across all of them at once. The FTS5 full-text search index enables sophisticated pattern matching.

Example: Finding correlated events across log files

find /var/log -name "*.log" -mtime -7 | sift --dig --for "
  SELECT filepath, content
  FROM search_fts
  WHERE content MATCH 'timeout AND database'
"

Example: Aggregating metrics from multiple JSON exports

find ./exports -name "metrics_*.json" | sift --dig --for "
  SELECT
    filepath,
    COUNT(*) as line_count,
    SUM(CASE WHEN content LIKE '%error%' THEN 1 ELSE 0 END) as error_mentions
  FROM lines
  JOIN files USING (file_id)
  GROUP BY filepath
  ORDER BY error_mentions DESC
"

Example: Dataset inventory

find ./data -name "*.csv" | sift --dig --for "
  SELECT filepath, line_count as rows FROM files ORDER BY line_count DESC
"

--refine FILE --for "QUERY"

Transform an entire file using SQL. The query must return a content column, and the output replaces the file contents.

Use this for whole-file transformations: reformatting, sorting, deduplication. The query processes every line and the results become the new file.

Example: Sorting a dataset by a column

sift --refine data.csv --for "
  SELECT content FROM lines
  ORDER BY
    CASE WHEN line_number = 1 THEN 0 ELSE 1 END,
    CAST(regex_extract('^([^,]+)', content, 1) AS INTEGER)
"

This keeps the header row first, then sorts remaining rows by the first column numerically.

Example: Removing duplicate lines

sift --refine duplicates.txt --for "
  SELECT content FROM lines GROUP BY content ORDER BY MIN(line_number)
"

Example: Normalizing whitespace in a dataset

sift --refine messy.csv --for "
  SELECT trim(regex_replace('\s+', content, ' ')) as content
  FROM lines
  ORDER BY line_number
"

--pick FILE --for "QUERY"

Surgical line editing. The query must return line_number and content columns. Only the specified lines are modified; others remain unchanged.

When you need to fix specific rows without touching the rest of the file, --pick is precise. This is essential for cleaning data where only certain records have issues.

Example: Fixing malformed dates in specific rows

sift --pick transactions.csv --for "
  SELECT line_number,
         regex_replace('(\d{2})/(\d{2})/(\d{4})', content, '\$3-\$1-\$2') as content
  FROM lines
  WHERE content LIKE '%/__/____,%'
"

This converts MM/DD/YYYY to YYYY-MM-DD only on lines that have the old format.

Example: Anonymizing PII in flagged records

sift --pick users.csv --for "
  SELECT line_number,
         regex_replace('[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}', content, '[REDACTED]') as content
  FROM lines
  WHERE content LIKE '%@example.com%'
"

Example: Correcting a known bad value

sift --pick sensor_data.csv --for "
  SELECT line_number, replace(content, '-999.0', 'NULL') as content
  FROM lines
  WHERE content LIKE '%-999.0%'
"

--sweep --for "QUERY"

Batch editing across multiple files. Reads file paths from stdin. Query must return filepath, line_number, and content.

For applying the same transformation across an entire dataset directory, --sweep processes multiple files in one pass.

Example: Standardizing column names across all CSVs

find ./data -name "*.csv" | sift --sweep --for "
  SELECT f.filepath, l.line_number,
         replace(replace(l.content, 'user_id', 'customer_id'), 'User ID', 'customer_id') as content
  FROM lines l
  JOIN files f USING (file_id)
  WHERE l.line_number = 1
"

Example: Removing a deprecated field from all JSON files

find ./configs -name "*.json" | sift --sweep --for "
  SELECT f.filepath, l.line_number,
         regex_replace('\"legacy_flag\":\s*(true|false),?\s*', l.content, '') as content
  FROM lines l
  JOIN files f USING (file_id)
  WHERE l.content LIKE '%legacy_flag%'
"

--drop-after N FILE

Insert content after line N. Use --content "text" to specify what to insert, or pipe content via stdin.

Example: Adding a computed column header

echo ",profit_margin" | sift --drop-after 0 sales.csv

This prepends to line 1 (after line 0), adding a new column to the header.

Example: Inserting a data quality note

sift --drop-after 1 report.csv --content "# Data extracted on $(date), source: production DB"

--drop-before N FILE

Insert content before line N.

Example: Adding a schema header to a headerless CSV

sift --drop-before 1 raw_data.csv --content "timestamp,sensor_id,temperature,humidity,pressure"

Example: Prepending a license header to data files

sift --drop-before 1 dataset.csv --content "# Licensed under CC-BY-4.0. Attribution: Example Corp."

--peek FILE

Read and display a file with line numbers. Combine with --layer N-M to view specific ranges.

Before transforming data, you often need to inspect it. --peek shows line numbers, making it easy to identify which rows need attention.

Example: Inspecting the structure of a large file

sift --peek huge_dataset.csv --layer 1-5
     1  customer_id,order_date,amount,product_id,region
     2  1001,2024-01-15,250.00,SKU-001,WEST
     3  1002,2024-01-15,125.50,SKU-002,EAST
     4  1003,2024-01-16,89.99,SKU-001,WEST
     5  1004,2024-01-16,432.00,SKU-003,NORTH

Example: Checking the end of a log file

sift --peek application.log --layer 9995-10000

--quarry [ACTION]

Manage the persistent workspace index. Actions: init, status, refresh, rebuild.

For large codebases or datasets, Sift can maintain a persistent index that speeds up repeated queries. The index is stored in a .sift/ directory.

Example: Setting up an index for a data warehouse

cd /data/warehouse
sift --quarry init

Example: Checking index status

sift --quarry status
Workspace: /data/warehouse
Database:  .sift/workspace.db (45.2 MB)
Files:     1,247 indexed
Lines:     3,891,024 total
Indexed:   2 hours ago
Stale:     12 files changed since last index

Example: Updating after new data arrives

sift --quarry refresh

Output Options

--grain FORMAT

Set output format. Options: plain (default), tsv, csv, json, ndjson, grep.

Data pipelines often require specific formats. --grain lets Sift output directly to the format your downstream tools expect.

Example: Exporting query results as JSON for a dashboard

cat sales.csv | sift --grain json --for "
  SELECT
    substr(content, 1, instr(content, ',') - 1) as date,
    COUNT(*) as transactions
  FROM lines
  WHERE line_number > 1
  GROUP BY date
"
[
  {"date": "2024-01-15", "transactions": 42},
  {"date": "2024-01-16", "transactions": 38},
  {"date": "2024-01-17", "transactions": 55}
]

Example: Creating TSV for spreadsheet import

cat data.csv | sift --grain tsv --for "
  SELECT line_number, content FROM lines WHERE content LIKE '%ERROR%'
"

Example: NDJSON for streaming to Elasticsearch

cat events.log | sift --grain ndjson --for "
  SELECT
    regex_extract('^(\d{4}-\d{2}-\d{2})', content, 1) as date,
    regex_extract('\[(.*?)\]', content, 1) as level,
    content as raw
  FROM lines
"

--count, -c

Output only the row count, not the data.

Quick sanity checks and data validation often just need counts, not full results.

Example: Counting records matching a condition

cat transactions.csv | sift -c --for "
  SELECT * FROM lines WHERE content LIKE '%FAILED%'
"
247

Example: Validating expected row count

EXPECTED=10000
ACTUAL=$(cat data.csv | sift -c --for "SELECT * FROM lines WHERE line_number > 1")
if [ "$ACTUAL" -ne "$EXPECTED" ]; then
  echo "Row count mismatch: expected $EXPECTED, got $ACTUAL"
fi

--head N

Limit output to the first N rows.

Example: Previewing query results

cat huge_file.csv | sift --head 10 --for "
  SELECT content FROM lines WHERE content LIKE '%anomaly%'
"

Example: Top N analysis

cat sales.csv | sift --head 5 --for "
  SELECT content FROM lines
  WHERE line_number > 1
  ORDER BY CAST(regex_extract(',([0-9.]+)', content, 1) AS REAL) DESC
"

--tail N

Limit output to the last N rows.

Example: Most recent log entries

cat application.log | sift --tail 20 --for "SELECT content FROM lines"

Example: Bottom performers in a ranking

cat performance.csv | sift --tail 10 --for "
  SELECT content FROM lines
  WHERE line_number > 1
  ORDER BY CAST(regex_extract(',([0-9.]+)', content, 1) AS REAL) DESC
"

File Reading

--layer N-M

Limit file reading to lines N through M. Use with --peek or when processing large files.

When files are enormous, reading the entire thing wastes resources. --layer lets you focus on specific sections.

Example: Sampling the middle of a sorted dataset

sift --peek sorted_by_date.csv --layer 5000-5010

Example: Extracting a date range from time-ordered data

# First, find where January data starts
cat timeseries.csv | sift --for "
  SELECT line_number FROM lines WHERE content LIKE '2024-01-%' LIMIT 1
"
# Then extract that range
sift --peek timeseries.csv --layer 1042-2156

Editing & Safety

--shake

Preview changes without modifying files (dry run).

Before committing changes, especially on production data, always preview first.

Example: Previewing a batch transformation

find ./data -name "*.csv" | sift --sweep --shake --for "
  SELECT f.filepath, l.line_number,
         replace(l.content, 'N/A', '') as content
  FROM lines l JOIN files f USING (file_id)
  WHERE l.content LIKE '%N/A%'
"
[dry-run] Would modify 3 line(s) in ./data/q1.csv
[dry-run] Would modify 7 line(s) in ./data/q2.csv

[batch-edit] Would modify 2 file(s)

--diff

Show a unified diff of changes.

When you need to see exactly what will change, line by line.

Example: Reviewing a data cleaning operation

sift --pick data.csv --diff --for "
  SELECT line_number, trim(content) as content
  FROM lines WHERE content != trim(content)
"
Update(data.csv)
Added 2 lines, removed 2 lines
   1 -    customer_id,amount,date
   1 +    customer_id,amount,date
   15-    1042,  250.00, 2024-01-15
   15+    1042,  250.00, 2024-01-15

--no-backup

Skip creating .bak backup files when editing.

By default, Sift creates backups before modifying files. Use this flag in automated pipelines where you have other backup mechanisms.

Example: Pipeline with version control

# Files are in git, no need for .bak files
sift --refine data.csv --no-backup --for "
  SELECT upper(content) as content FROM lines ORDER BY line_number
"

Advanced Features

--weigh

Include query execution statistics.

When optimizing queries on large datasets, timing information helps identify bottlenecks.

Example: Profiling a complex aggregation

cat large_dataset.csv | sift --weigh --for "
  SELECT
    regex_extract('^([^,]+)', content, 1) as category,
    COUNT(*) as count,
    AVG(CAST(regex_extract(',([0-9.]+)', content, 1) AS REAL)) as avg_value
  FROM lines
  WHERE line_number > 1
  GROUP BY category
"
category    count    avg_value
ELECTRONICS 15234    127.45
CLOTHING    23421    45.67
FOOD        8932     12.34

-- Stats: 47587 rows scanned, 3 rows returned, 0.045s execution time

--sieve [N]

Watch mode—re-run the query when files change. Optional interval in seconds.

For monitoring live data feeds or log files, --sieve provides continuous updates. Note: Not available in MCP mode.

Example: Live error monitoring

sift --sieve 5 --peek /var/log/app.log --layer -50 | grep ERROR

Example: Watching a data ingestion directory

find ./incoming -name "*.csv" | sift --sieve 10 --dig --for "
  SELECT filepath, line_count FROM files ORDER BY filepath
"

--plugin PATH

Load additional SQL or C plugins for extended functionality.

Plugins add new SQL functions. Sift ships with several useful ones.

Example: Using the CSV parser plugin

cat data.csv | sift --plugin plugins/csv.so --for "
  SELECT csv_field(content, 3) as third_column FROM lines
"

Example: Base64 encoding for data transfer

cat sensitive.csv | sift --plugin plugins/encoding.so --for "
  SELECT base64_encode(content) as encoded FROM lines
"

--quiet, -q

Suppress non-essential output (banners, progress messages).

For scripting and automation, clean output matters.

Example: In a shell script

COUNT=$(cat data.csv | sift -q -c --for "SELECT * FROM lines WHERE line_number > 1")
echo "Processed $COUNT records"

--mcp

Run as an MCP (Model Context Protocol) server for AI agent integration.

This enables AI assistants like Claude to use Sift for code and data exploration.

Example: Registering with Claude Code

claude mcp add sift -- sift --mcp

--list-plugins

List available plugins and their functions.

sift --list-plugins
Loaded plugins:
  fields.sql    - Awk-style field extraction (f1, f2, ...)
  dedup.sql     - Deduplication views
  statistics.sql - Line length statistics
  encoding.so   - Base64, Hex, URL encoding functions
  csv.so        - RFC 4180 CSV parsing

Quick Reference Table

For quick lookup, here's every flag organized by category:

Category Flag Description
Command Modes --for "QUERY" SQL on stdin
--dig --for "QUERY" Multi-file search
--refine FILE --for "Q" Transform entire file
--pick FILE --for "QUERY" Edit specific lines
--sweep --for "QUERY" Batch edit multiple files
--drop-after N FILE Insert after line N
--drop-before N FILE Insert before line N
--peek FILE Read file with line numbers
--quarry [ACTION] Manage workspace index
--mcp MCP server mode
Output --grain FORMAT plain, tsv, csv, json, ndjson, grep
--count, -c Row count only
--head N First N rows
--tail N Last N rows
File Reading --layer N-M Line range (use with --peek)
Safety --shake Dry run (preview changes)
--diff Show unified diff
--no-backup Skip .bak files
Advanced --weigh Query statistics
--sieve [N] Watch mode (re-run on changes)
--plugin PATH Load extension
--quiet, -q Suppress banners
Info --help, -h Show help
--version, -V Show version
--list-plugins List loaded plugins
Home Intro Comparison GitHub