jeadie's comments

jeadie | 5 months ago | on: Vector database that can index 1B vectors in 48M

We’re building vector indexes into Datafusion for search (starting with S3 vectors).

Open source at https://github.com/spiceai/spiceai

jeadie | 9 months ago | on: Airport for DuckDB

This is one of the ideas behind using DuckDB in github.com/spiceai/spiceai

jeadie | 10 months ago | on: Show HN: TextQuery – Query CSV, JSON, XLSX Files with SQL

There’s also https://github.com/spiceai/spiceai

jeadie | 1 year ago | on: Pinecone integrates AI inferencing with vector database

This is a common feature now. If anything, for being so early to vector databases, Pinecone was rather late to integrating embeddings.

Timescale most recently added it but, yes a bunch of others: Weaviate, Spice AI, Marqo, etc.

jeadie | 1 year ago | on: Pg_parquet: An extension to connect Postgres and parquet

Why not just federate Postgres and parquet files? That way the query planner can push down as much of the query and reduce how much data has to move about?

jeadie | 1 year ago | on: Pg_lakehouse: Query Any Data Lake from Postgres

This looks functionally similar as using http://github.com/spiceai/spiceai with a postgreSQL data accelerator.

jeadie | 1 year ago | on: Ask HN: Who is hiring? (April 2024)

Spice AI | Senior Software Engineer | GMT+10 (e.g. Australia) through GMT-7 (e.g. Seatle/SF/LA) | Remote | Full Time

Spice AI provides building blocks for data and AI-driven applications by composing real-time and historical time-series data, high-performance SQL query, machine learning training and inferencing, in a single, interconnected AI backend-as-a-service.

We just launched github.com/spiceai/spiceai, a unified SQL query interface and portable runtime to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.

We're hiring experienced software engineers, ideally with Rust and/or Golang production experience. We're focused on large data and distributed systems, experience in these is important too. More details: https://spice.ai/careers#section-open-positions

jeadie | 1 year ago | on: Show HN: Spice.ai – materialize, accelerate, and query SQL data from any source

And yes, Iceberg is very high up on our list

jeadie | 1 year ago | on: Show HN: Spice.ai – materialize, accelerate, and query SQL data from any source

Yes! It can connect to FlightSQL compatible servers (see https://docs.spiceai.org/data-connectors/flightsql ) and its also a FlightSQL compatible server

jeadie | 2 years ago | on: Show HN: Yes, another vector embeddings API

Have you seen github.com/marqo-ai/marqo? It does all this wrapping, and you don't even need to pay for OpenAI or pinecone

jeadie | 2 years ago | on: GGML – AI at the Edge

I'm very glad that this has some added funding. I am building a serverless API on the cloudflare edge network using GGML as the backbone --> tryinfima.com

jeadie | 2 years ago | on: Weaviate – Open-Source AI Native Vector Database

"AI Native" catching on

jeadie | 2 years ago | on: PrivateGPT

I've tried both Chroma and Qdrant. I don't think Chroma lacks that much. Definitely newer, but is also a great product. I think cloud support coming Q3 2023

jeadie | 2 years ago | on: Ask HN: Seeking a Vector Database for ClickHouse Users – Suggestions Appreciated

(Not affiliated with hyperDB)

jeadie | 2 years ago | on: Ask HN: Seeking a Vector Database for ClickHouse Users – Suggestions Appreciated

I've been using https://github.com/jdagdelen/hyperDB and it's been really easy to use. I think Clickhouse support is on the short-term roadmap.

jeadie | 2 years ago | on: After All Is Said and Indexed – Unlocking Information in Recorded Speech

Most people, like me, who end up needing to use vector DBs, are wanting to use LLMs on a specific, often private dataset/use case. Typically one starts with something like unstructured JSON data, then need to pick and manage LLMs to create embeddings, then store these and the original JSON data in a vectorDB. Then the application is some variety of CRUD operations + searching over both the original data and the embeddings.

Chroma, Pinecone, I guess FAISS/HNSWlib/etc only handle vector operations. Really what I'd want, which Marqo does, is handle everything end to end.

jeadie | 2 years ago | on: After All Is Said and Indexed – Unlocking Information in Recorded Speech

Not a dumb question at all! Essentially what can do Marqo, and this blog shows, is that there is alot of logic and work to do what you said (i.e. pass raw data into LLM, get embeddings, store in vector DB, then query both embeddings and original data).

jeadie | 2 years ago | on: After All Is Said and Indexed – Unlocking Information in Recorded Speech

Its a great tool. Unlike vectorDBs alone, Marqo helps the full process that alot of people end up wanting to use vectorDBs for (e.g. have structured data, use LLMs to create embeddings, and perform search/CRUD on embeddings + original data).

jeadie | 2 years ago | on: After All Is Said and Indexed – Unlocking Information in Recorded Speech

Being able to handle and ask questions of audio data is a pretty big field. https://www.assemblyai.com/, for example, is a company entirely dedicated to audio intelligence. They have some great example use cases on their page.

jeadie | 2 years ago | on: Do you need a vector database?

This is generally very context/use case specific. In general, if a document is a `Dict[str, Any]`, then you either have to have one (or multiple) vector(s) per field, unless you want to combine vectors across fields (it's not self-evident how you'd best do that). In saying that, specific reason's to do this (or why I've done it in the past).

1. Chunking long text fields in documents so as to get a better semantic vector for them (also you can only fit so much into an LLM). 2. Differently to 1. chunking long text fields (or even chunking images, audio, etc), is one way to perform highlighting. It helps to answer the question, for example, for a given document what about it was the reason it was returned? You can then point to the area in the image/text/audio that was most relevant. 3. You may want to run different LLMs on different fields (perhaps a separate multi-modal LLM vs a standard text LLM), or like another comment said have different transforms/representations of the same field.

Perhaps 100 vectors is non-standard, but definitely not unseen.