Where Sift Fits
The idea of querying text files with SQL is not new. A small ecosystem of tools has grown up around this premise, each carving out its own niche. Some target data analysts who live in CSVs and JSON exports. Others aim at system administrators parsing logs. When I released Sift, several people asked the obvious question: why build another one? The answer lies in a gap I kept bumping into—a missing link between Unix text processors and database engines that becomes visible only when you try to treat a codebase as queryable data.
The Structured Data Camp
The most mature SQL-on-text tools assume your data has columns. This is a reasonable assumption if you're a data scientist working with exports, but it falls apart the moment you point them at a source file or a configuration.
q is perhaps the most widely adopted. It parses delimited files—CSVs, TSVs, anything with a consistent separator—and lets you query them with familiar SQL syntax. For its intended use case, it works beautifully:
q -H -d, "SELECT product, SUM(revenue) FROM sales.csv GROUP BY product"
The -H flag tells it the first row is a header. The -d, specifies the delimiter. But try to use it on a Python file, and you're stuck. There's no delimiter to speak of, no columns to select. The tool simply isn't designed for that.
TextQL occupies similar territory. It wraps SQLite, ingesting structured files into temporary tables. dsq extends this pattern to JSON, Parquet, and Excel files. These are excellent tools for what they do, but "what they do" is data analysis, not code exploration.
At the heavyweight end sits DuckDB, an embeddable analytical database that runs as a CLI binary. It can query enormous CSV and Parquet files with remarkable speed, supporting window functions, CTEs, and the full arsenal of modern SQL. For data work, it's extraordinary. For finding where a function is called in a codebase, it feels like deploying a fighter jet to fetch groceries.
The Log Watchers
System administrators face a different problem: logs are text, but they're not tabular. Each line is a discrete event, often with embedded structure (timestamps, log levels, JSON payloads) but no global schema. The tools built for this domain come closest to what Sift attempts.
lnav (the Log File Navigator) is the most sophisticated. It parses common log formats automatically, building virtual tables that you can query with SQL:
:filter-in log_level = 'error'
;SELECT log_time, log_body FROM syslog WHERE log_level = 'error'
The semicolon prefix enters SQL mode. The tool understands syslog, Apache logs, and dozens of other formats out of the box. It even provides a curses interface for interactive exploration.
But lnav is an observer, not an editor. It's designed to help you understand what happened in a system, not to change files. There's no equivalent to Sift's --refine or --pick. You can't use it to fix the malformed log entries you've identified.
The Holy Trinity
Before any of these tools existed, Unix provided its own answer: grep for searching, sed for editing, awk for transforming. This trio has served developers for five decades, and for simple tasks, they remain unmatched in terseness:
grep -n "TODO" *.py # Find all TODOs
sed -i 's/foo/bar/g' config.yaml # Replace foo with bar
awk '{sum += $2} END {print sum}' # Sum the second column
The friction appears when tasks grow complex. Suppose you want to find all Python files where a function is defined but never called. Or replace a variable name, but only in lines that match a certain pattern, and only in files modified in the last week. Suddenly you're chaining four tools together, debugging quoting issues, and wondering if there's a simpler way. I've been there more times than I can count.
There's a reason "write a Python script" becomes the default answer for anything beyond trivial text manipulation. The cognitive overhead of combining grep, sed, and awk correctly exceeds the overhead of just writing procedural code. That realization is what pushed me to build something different.
The Gap
If you plot these tools on two axes—structured versus unstructured input, read-only versus read-write—you'll notice a quadrant that's mostly empty:
| Tool | Primary Use | Raw Lines? | Edits Files? |
|---|---|---|---|
| Sift | Codebases, raw text | Yes | Yes |
| q / TextQL | CSV, tabular data | No | No |
| DuckDB | Analytics at scale | No | No |
| lnav | Log analysis | Yes | No |
| ripgrep | Fast search | Yes | No |
| sed | Stream editing | Yes | Yes |
The intersection of "works on raw lines" and "can edit files" contains only sed—and sed, for all its power, doesn't speak SQL. It speaks its own cryptic language of addresses and commands, a language that most developers learn just enough of to be dangerous.
This is the gap I designed Sift to fill. It treats every line of every file as a row in a table, exposes that table to standard SQL, and can write the results back. The mental model I wanted was simple: your codebase is a database; queries return data; some queries also change it.
A Practical Comparison
Let me walk through a concrete task: finding all JavaScript files that import a deprecated module, then adding a comment flagging them for migration. Here's how you might approach it with different tools.
With grep and sed:
# Find the files
grep -rl "from 'old-module'" --include="*.js" src/
# Add a comment (but only after the import line)
# This is where it gets complicated...
find src -name "*.js" -exec grep -l "from 'old-module'" {} \; | \
xargs -I{} sed -i "/from 'old-module'/a // TODO: Migrate away from old-module" {}
The sed incantation works, but you need to remember that /pattern/a appends after matching lines, that -i edits in place, and that the whole thing assumes GNU sed (BSD sed on macOS behaves differently).
With Sift:
find src -name "*.js" | sift --sweep --for "
SELECT f.filepath, l.line_number,
l.content || char(10) || '// TODO: Migrate away from old-module' as content
FROM lines l
JOIN files f USING (file_id)
WHERE l.content LIKE '%from ''old-module''%'
"
The SQL is longer, but it's also readable. You can see exactly what's happening: join lines to files, filter for the import statement, append the comment. The char(10) is a newline. Preview it with --shake first, see the diff with --diff, then run it for real.
With a Python script:
import os
import re
for root, dirs, files in os.walk('src'):
for f in files:
if f.endswith('.js'):
path = os.path.join(root, f)
with open(path, 'r') as file:
lines = file.readlines()
modified = False
new_lines = []
for line in lines:
new_lines.append(line)
if "from 'old-module'" in line:
new_lines.append('// TODO: Migrate away from old-module\n')
modified = True
if modified:
with open(path, 'w') as file:
file.writelines(new_lines)
This is clear and correct, but it's twenty lines of code for a one-off task. You'll write it, run it, and never use it again. I wanted Sift to make the one-liner viable for cases that currently push you toward scripts.
When to Use What
No tool is universal. Here's how I think about choosing:
Use q or TextQL when you have clean CSVs and want quick aggregations. They're fast to invoke and require no setup. If your data has headers and delimiters, these tools will feel natural.
Use DuckDB when you're doing serious data analysis—joins across multiple large files, window functions, analytical queries that would choke lighter tools. It's also the right choice when you need to query Parquet or other columnar formats.
Use lnav when you're investigating incidents in log files. Its format detection and interactive interface are purpose-built for that workflow. The SQL capability is a bonus on top of excellent log parsing.
Use ripgrep when you just need to find something fast. Nothing beats rg pattern for speed. If you don't need to transform or aggregate, don't reach for SQL.
Use Sift when you need to query or transform unstructured text with more logic than grep allows, especially if you want to edit files based on what you find. It's the right tool when you catch yourself chaining three utilities together and wishing for a single coherent language.
The Underlying Philosophy
Each of these tools embodies a philosophy. The structured-data tools believe your text should be organized before you query it. The Unix tools believe small, composable programs are always better than large, integrated ones. lnav believes logs deserve special treatment.
When I built Sift, I started from different assumptions: that developers already know SQL, that raw text is a valid thing to query, and that the gap between "find" and "fix" should be smaller. I don't see it as a replacement for any of the tools above—it's a complement, filling a niche they don't address.
Whether that niche matters to you depends on your work. If you spend your days in Jupyter notebooks processing clean datasets, you may never need it. If you maintain a large codebase and regularly find yourself writing throwaway scripts to refactor code or clean up data files, Sift might become indispensable. It has for me.
The source is on GitHub. I'd suggest trying it on a real task—not a toy example, but something you'd otherwise solve with grep-sed-awk or a Python script. That's the only honest way to evaluate whether it earns a place in your toolkit.