top | item 46835610

(no title)

sureglymop | 1 month ago

I'm working on a project in this realm. Basically I am building a "personal" spyware/data collection software suite. Kind of in the same realm as ms recall but more focused on security/privacy with sensible cryptographic defaults where needed.

It is basically a server/data store and client agents, currently the agent is only for linux end user devices. The agent records evdev events (keystrokes/mouse movement), currently active window, clipboard history, shell commands issued and browsing behaviour. It runs as its own user and different functionality is compartmentalized into their own processes. Data is encrypted at rest. I'm still looking into how to best handle sensitive data in-memory at runtime.

It stores these events in a persistent queue on the clients and one-way syncs it to the server. If a client is offline for a bit it syncs it when it comes back online. As such, I am also trying to minimize storage used.

The idea is that rather than overwhelmingly linking stuff manually, e.g. with obsidian, locality of reference seems more useful as a baseline. In this data set, links by time are valued the most. In the future I'd like to add also the screenshot/video feature using hashes and perceptual hashes or an RDP like way to store as little data as possible.

For now I'm mostly in the architecting phase but I do have an early working version. Really looking for suggestions architecture wise too. So far I came up with my own binary format to save events on the clients but I'm unsure if it's the right way to go. There are many challenges to be thought about, such as changing hardware configuration (display plugged in), protecting against statistical analysis (e.g. keystroke bursts), deletion of data across sources if required, how to make sure the system can run smoothly for a decade, etc.

discuss

order

item007|29 days ago

This is a super interesting (and refreshingly candid) direction. You’re basically building a local-first “life event ledger” with delayed sync.

Actually, I'm not an expert in this area, but I feel the challenge may not lie in data collection itself, but rather in ensuring the data remains secure, usable, and easy to maintain over many years.

A custom binary format can work, but it could be a long-term maintenance commitment (schema evolution, tooling, corruption recovery).

richardfey|28 days ago

Please let me know what you end up doing with this, I am curious!