spql

GitHub Repo stars

echo '"https://www.reddit.com/r/python/top.json"' |
spql '
    # Issue a get request
    http_get |
    # Process the json response with standard jq
    .data.children[] | 
    {
        "title": .data.title, 
        "author": .data.author,
        "title_tokens": .data.title | tokenize("hf-internal-testing/tiny-random-gpt2") ,
        "title_prompt": (.data.title | prompt("llama2c")),
        "title_embedding_4": (.data.title | embed("hf-internal-testing/tiny-random-gpt2") | .[0:4]),
        "title_embedding_ndims": (.data.title | embed("hf-internal-testing/tiny-random-gpt2") | ndims)
    }
'

Here’s how spql queries can be executed in different contexts: (Spoiler alert: they look exactly the same everywhere)

Shell

echo '[1,2,3]' | spql 'vector | ndims'
echo '"Hello world"' | spql 'prompt'
echo '"Hello world"' | spql 'embed("hf-internal-testing/tiny-random-gpt2")' 

SQLite

sqlite3 <<SQL
.load spqlite

select spql('[1,2,3]', 'vector | ndims');
select spql('"Hello world"', 'prompt');
select spql('"Hello world"', 'embed("hf-internal-testing/tiny-random-gpt2")');

SQL

Postgres

psql <<SQL

create extension spql;
select spql('[1,2,3]'::jsonb, 'vector | ndims');
select spql('"Hello world"'::jsonb, 'prompt');
select spql('"Hello world"'::jsonb, 'embed("hf-internal-testing/tiny-random-gpt2")');

SQL

DuckDB

N/A yet; coming soon.

duckdb <<SQL
INSTALL spql;
LOAD spql;

select spql('[1,2,3]', 'vector | ndims');
select spql('"Hello world"', 'prompt');
select spql('"Hello world"', 'embed("hf-internal-testing/tiny-random-gpt2")');

SQL

spql is a command line tool, programming language and database extension for working with LLMs.

spql’s goal is to allow the creation of composable pipelines of vectors in a minimal but powerful way. It is equally usable and flexible across different execution contexts.

It borrows from Unix philosophy but with a twist: It treats vector as a universal interface, not text. In an LLM world, however, text and vector are not that different, are they?

The pattern is clear: sed, awk, grep, and friends worked for text. jq worked for JSON data. spql works on vectors.

In the snippet above, notice how spql glues things together:

Everything is executed in the shell.
the http_get function is run within the program itself (powered by curl).
The tokenizer and the embedding model used are coded in Python and provided by huggingface.
The llama2c model is coded in C and is local.
Input and output are JSON.

🚀 Getting Started#

Installation#

In the current alpha version, the easiest way to try out spql is to alias bash functions to docker containers.

Running these in your shell will provide you with a spql bash function and a sqlite3 instance pre-bundled with spql.

function spql() {
  docker run -v $HOME/.cache/huggingface:/.cache/huggingface -i florents/spql:v0.1.0a1 "$@"
}

function sqlite3() {
  docker run -i --entrypoint /usr/bin/sqlite3 florents/spql:v0.1.0a1 "$@"
}

This will allow you to replicate the examples below.

Usage#

Here’s how spql queries can be executed in different contexts:

Shell

echo '[1,2,3]' | spql 'vector | ndims'
echo '"Hello world"' | spql 'prompt'
echo '"Hello world"' | spql 'embed("hf-internal-testing/tiny-random-gpt2")' 

SQLite

sqlite3 <<SQL
.load spqlite

select spql('[1,2,3]', 'vector | ndims');
select spql('"Hello world"', 'prompt');
select spql('"Hello world"', 'embed("hf-internal-testing/tiny-random-gpt2")');

SQL

Postgres

N/A yet; coming soon.

psql <<SQL

create extension spql;
select spql('[1,2,3]'::jsonb, 'vector | ndims');
select spql('"Hello world"'::jsonb, 'prompt');
select spql('"Hello world"'::jsonb, 'embed("hf-internal-testing/tiny-random-gpt2")');

SQL

DuckDB

N/A yet; coming soon.

duckdb <<SQL
INSTALL spql;
LOAD spql;

select spql('[1,2,3]', 'vector | ndims');
select spql('"Hello world"', 'prompt');
select spql('"Hello world"', 'embed("hf-internal-testing/tiny-random-gpt2")');

SQL

Features#

spql is equally usable as a CLI tool, a programming language and as a database extension.

Here are the features:

Standard library of vector operations: L2 distance, inner product, and cosine distance
Interact with LLMs to generate, embed, and tokenize content.
Support for both local and remote LLMs.
Support for sentence-transformers and huggingface transformers.
Support for llamafile models.
Out-of-the-box extensions for Postgres, SQLite, and DuckDB.
100% jq-compliant. Your existing jq programs continue to work. Your jq-fu can still make you a codegolf superstar.

Examples#

spql simply extends jq by adding some custom types and functions.

This means that standard jq syntax and functionality is 100% availble. Thus, have a look at jq’s manual.

Below are some examples that showcase some of these types and functions available, but are mostly here to wet your appetite.

See the Reference for details.

Prompts#

spql expects json as input:

The most basic thing you can do is

echo '"Hello World"' | spql 'prompt'

Multiple Prompts#

To pass multiple prompts in one go you can pipe a json array instead like this, and use map(generate):

echo '["Hello", "Hi", "Howdy", "¡Hola!", "Γειά!"]' | 
spql 'map(generate)'

Or like this:

echo '["Hello", "Hi", "Howdy", "¡Hola!", "Γειά!"]' | 
spql ' .[] | { "prompt": ., "response": prompt}'

Choosing Model#

To specify a model use prompt("model_name")

By default spql uses the llama2c model for demonstration purposes: prompt("llama2c")

Llamafile Models#

spql natively supports LLamafile. It does so via llamafile’s web API.

Download an example llamafile and run it in server mode:

wget "https://huggingface.co/jartine/phi-2-llamafile/resolve/main/phi-2.Q2_K.llamafile?download=true" -O /tmp/phi-2.Q2_K.llamafile
chmod +x /tmp/phi-2.Q2_K.llamafile
/tmp/phi-2.Q2_K.llamafile --nobrowser --port 8080

Now in another terminal process you can:

echo '"Are you a LLama?"' | 
spql 'prompt("http://localhost:8080")'

NOTE: llamafile support is not currently available in the docker image

`gguf` GPT4ALL Models#

spql supports gguf models via GPT4All.

Simply pick one of the available models and pass it as an argument.

echo '"Tell me a story about flying spaghetti"' | 
spql 'prompt("orca-mini-3b-gguf2-q4_0.gguf")'

NOTE: gguf GPT4ALL Models support is not currently available in the docker image

Embeddings#

To retrieve embedding vectors from text input you can use the embed function:

echo '"You embed me!"' | spql 'embed'

Similarly to prompt this uses the default model for embeddings.

Huggingface Transformers#

You can of choose any embedding model from Huggingface Transformers:

echo '"You embed me!"' | spql 'embed("hf-internal-testing/tiny-random-gpt2")'

Tokenizers#

Similarly to embeddings you can use tokenizers. The resulting array of course will have integers instead of floats.

echo '"You tokenize me!"' | spql 'tokenize("hf-internal-testing/tiny-random-gpt2")'

Vectors#

spql supports operations on vectors as well. You can treat vectors as json arrays of numbers and vice versa.

Here’s an example getting the l2_norm of an embedding vector:

echo '"You embed me!"' | 
spql 'embed("hf-internal-testing/tiny-random-gpt2") | l2_norm'

Below are some more examples of vector operations:

spql -n '[1,2,3,4] | l2_norm'

spql -n '[1,2,3,4] | l2_distance([11,22,33,43])'

spql -n '[1,2,3,4] | dot_product([1,3,5,40])'

spql -n '[1,2,3,4] | cosine_similarity([1,3,5,40])'

Database Extensions#

You can run spql programs in your SQL queries! This is achieved through database extensions. These extensions implement a spql SQL function you can generally call like this:

SELECT spql(json, '...spql..');

The first argument is a json object and the second argument is a spql program like the one in the examples above. It is of course convenient that SQL uses single quotes (') for quoting while json uses double quotes ". This means you can pass any complex spql programs verbatim and let the database handle the complexity.

Let’s generate alternative openings for some classic books

SQLite#

sqlite3 <<SQL
.load spqlite

CREATE TABLE books
(
    author  TEXT,
    year    INTEGER,
    title   TEXT,
    opening TEXT
);

INSERT INTO books (author, year, title, opening)
VALUES ('"Jane Austen"', 1813, 'Pride and Prejudice',
        '"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."'),
       ('"Charles Dickens"', 1859, 'A Tale of Two Cities',
        '"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness..."'),
       ('"Herman Melville"', 1851, 'Moby-Dick', '"Call me Ishmael."'),
       ('"Leo Tolstoy"', 1869, 'War and Peace',
        '"Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes."'),
       ('"Mary Shelley"', 1818, 'Frankenstein',
        '"You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings."');

SELECT title, spql(opening, 'prompt')
FROM books;
SQL

Postgres#

The same code as above works on Postgres as well. The only thing that changes is the way the extension is installed.

psql <<SQL
CREATE EXTENSION spql;
CREATE TABLE books
(
    author  TEXT,
    year    INTEGER,
    title   TEXT,
    opening TEXT
);

INSERT INTO books (author, year, title, opening)
VALUES ('"Jane Austen"', 1813, 'Pride and Prejudice',
        '"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."'),
       ('"Charles Dickens"', 1859, 'A Tale of Two Cities',
        '"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness..."'),
       ('"Herman Melville"', 1851, 'Moby-Dick', '"Call me Ishmael."'),
       ('"Leo Tolstoy"', 1869, 'War and Peace',
        '"Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes."'),
       ('"Mary Shelley"', 1818, 'Frankenstein',
        '"You will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings."');

SELECT title, spql(opening::jsonb, 'prompt')
FROM books;

DROP TABLE books;
SQL

DuckDB#

NOTE: comming soon.

Cookbook#

Model Aliases#

To save yourself some typing you can pass model names (and indeed any other variables) to spql programs using the --arg flag

echo '"John the Baptist"' | 
spql --arg m "hf-internal-testing/tiny-random-gpt2" '
  embed($m)
'

Embed Json Recursively#

Here’s an example that walks a json document recursively and replaces any string value with it’s embedding vector. It does so using the standard jq walk(f) function.

echo '{
  "books": [
    {
      "title": "The Hobbit",
      "author": "J.R.R. Tolkien",
      "published": 1937
    },
    {
      "title": "Foundation",
      "author": "Isaac Asimov",
      "published": 1951
    }
  ],
  "library": "City Library"
}
' | 
spql 'walk(if type == "string" then . = embed("hf-internal-testing/tiny-random-gpt2") else . end)'

Model Pipelines#

Now this is cool: You can pipe different models together in a Unix-like fashion.

echo '"Tell me a story"' |
spql 'prompt("llama2c")' |
spql -c 'embed("hf-internal-testing/tiny-random-gpt2")'

Complex Programs#

In the previous example we embedded only the first element in the children array. To embed the title of each element we could do it like this:

echo '"https://www.reddit.com/r/python/top.json"' |
spql '
    # Issue a get request
    http_get |
    # Process the json response with standard jq
    .data.children[] | 
    {
        "title": .data.title, 
        "author": .data.author,
        "title_tokens": .data.title | tokenize ,
        "title_prompt": (.data.title | prompt("llama2c")),
        "title_embedding_4": (.data.title | embed | .[0:4]),
        "title_embedding_ndims": (.data.title | embed("hf-internal-testing/tiny-random-gpt2") | ndims)
    }
'

Program from files#

For complex programs you can use the -f src.jq argument to execute programs written in files.

cat << EOF > /tmp/myprogram.jq
. | length
EOF

spql -f /tmp/myprogram.jq

Passing `--arg`uments#

Here’s an example shell script that iterates over the lines of a csv file and generates embeddings for it’s columns. Notice how bash variables can be passed to the program using --arg.

cat <<EOF > /tmp/spql.csv
id,name,title
1,John Doe,Software Engineer
2,Jane Doe,Data Scientist
3,James Doe,Product Manager
EOF

# Skip the header line using tail -n +2
tail -n +2 "/tmp/spql.csv" | while IFS=, read -r id name title
do
  spql --arg id "$id" --arg name "$name" --arg title "$title" -n '{
    id: $id,
    name: $name,
    title: $title,
    title_embedding: ($title | embed)
  }'
done

Issue a GET request, extract a piece of json with standard jq, embed using.

echo '"https://www.reddit.com/r/llm/top.json"' |
spql '
    http_get |
    .data.children[0].data.title |
    embed("hf-internal-testing/tiny-random-gpt2") |
    l2_norm
'

echo '
[
  ["King", "Queen"],
  ["Table", "Tableau"],
  ["Dog", "Hot"]
]
' | 
spql 'map([ (.[0] | embed) , (.[1] | embed) ])'

FAQ#

What’s the motivation?#

There are already lots of LLMs from different providers and many tools. spql does not aim to replace these tools; it just glues them together.

The future (probably) belongs to LLMs targeting specific vertical domains, We’ll need tools to glue these things together.

People will probably glue different models together in different contexts and with different fine-tuning: Imagine a scenario using a local-model-1 for a subset of on-prem data and a remote-gpt-2 for embedding customer reviews and a local-tokenizer-3 used for tokenizing product descriptions.

In other words, we need something like SQL, but instead of joining tables, we need to join models.

spql aims to allow exactly that: The declarative specification of pipelines of different models, like SQL does for tables.

What’s the current status?#

Currently, spql is in a closed-alpha version and is tried out by people in different organizations for some initial feedback. Depending on that feedback, it will be tweaked, and it will be open-source sooner or later.

What’s next on the roadmap?#

As a developer utility, spql can already save many keystrokes of and tool installations.

Here are some thoughts about the steps beyond that:

A simple web app like https://jqplay.org (in progress)
Better support and housekeeping for remote APIs (handling API keys, rate limiting, etc.)
Export integration (e.g. use spql to glue the models and store them to vector or to vector databases)
Smarter execution plans: currently everything is executed sequentially. There’s no reason to wait though for a remote LLM to respond while the local one is finished.
Use caching to store results.
More vector operations: knn-search, HNSW, IVFFlat
Delayed execution: to pause/resume partially executed queries.
Tighter database integration: within a database environment it can be much smarter that it already is, without changing the user intraface.
Keep track of costs: no one currently cares about this but everyone will soon do. We’ve had cost-based query execution for databases. We need the equivalent for LLM operations.
Better packaging and distribution (more Docker images, deb and rpm packages, Windows support).

Do I need to learn another language?#

Not exactly. Unless you’ve been coding in assembly only, chances are you’ve already used jq a few times. If you have, you can use spql as well. And if you do, you’ll probably use the three user-fencing functions: prompt, embed, tokenize

Isn’t jq complicated, though?#

It’s true that many jq programs out there can seem fancy and are showing off but this doesn’t have to be the case. The perceived complexity stems from the fact that jq has a rather small standard library and people hesitate to write multi-line programs in jq.

Only json?#

spql is based on jq, so it consums and returns json. but I’ve been thinking about other input formats, like Arrow.

How is this implemented?#

It’s written in C, just like jq.

You can think of spql as an extension to jq.

It uses its compiler, syntax and standard library, but it adds custom types and functions suitable for LLMs. It embeds the Python natively to enable access to its vast Machine Learning ecosystem.

Is there a formal spec of the language?#

jq is not formally formally specified, but here is a recent effort of a formal spec .

Is there a backstory?#

Actually yes, I’ve built pgJQ: jq extension for Postgres and liteJQ: jq extension for SQLite.

spql was inspired by these two tools.

Acknowledgments#

There’s already a lot of wrapping around in the broader AI / LLM community and many projects aren’t clear/honest enough about that. I’d like to be brutally honest:

spql just glues things together: a lot of things

Here’s a list of thanks:

Thanks to the jq community and especially to the contributors who breathed fresh air into the project.
Thanks to llama.cpp and ggerganov
Thanks to nomic and gpt-4-all
Thanks to jart & mozilla team for their effort in Llamafile
Thanks to simonw for cutting through a lot of the noise and building the APIs for an llm.

spql

🚀 Getting Started#

Installation#

Usage#

Features#

Examples#

Prompts#

Multiple Prompts#

Choosing Model#

Llamafile Models#

gguf GPT4ALL Models#

Embeddings#

Huggingface Transformers#

Tokenizers#

Vectors#

Database Extensions#

SQLite#

Postgres#

DuckDB#

Cookbook#

Model Aliases#

Embed Json Recursively#

Model Pipelines#

Complex Programs#

Program from files#

Passing --arguments#

FAQ#

What’s the motivation?#

What’s the current status?#

What’s next on the roadmap?#

Do I need to learn another language?#

Isn’t jq complicated, though?#

Only json?#

How is this implemented?#

Is there a formal spec of the language?#

Is there a backstory?#

Acknowledgments#

`gguf` GPT4ALL Models#

Passing `--arg`uments#