Page 1 of 1

Even private instances of the largest

Posted: Tue Feb 11, 2025 5:10 am
by asimd23
The data itself does not have to be sent to an expensive, remote, slow language model. Semantic matching (also known as “embedding” among those who train GenAI foundation models) offers an inexpensive and fast approach to high-flow data applications. Semantic matching works especially well for data because there is so much context (database, schema, type, etc.) known about the individual elements. The number 405 in a database is indistinguishable from other valid numeric values but if the table is “payroll” and the column is “num_dependents” a language model can flag the oddity.

But of course, if a language model is involved in any way, there will be concerns about where and how data is being sent and used. The good news is that there are now several proven options for air-gapped instances of lebanon whatsapp number data very competent models (Mistral, Llama, etc.) that can be deployed to toil away in a secure location, providing value but never revealing their secrets. models like GPT4V can be provisioned as securely as the cloud databases that ETL targets.

A ripe area for improvement in ETL is aggregation. Until now we have had coarse choices: either stream the detailed data into a warehouse or apply simplistic aggregation functions (sum, average, max, etc.) to grouped batches before storing. Language models allow us to consider event detection as an aggregate. Monitoring satellite images as they stream into a database, We could instruct the model to only save images with cyclone storms. In a security feed, we could save human conversations or loud noises but drop the rest. Save an event when a time signal becomes volatile. Keep five minutes before and after large sudden declines. These are all examples of event-based aggregation that a model can implement by generating verifiable signal/data processing code on the fly.