We use the Fast Fourier Transform (#FFT) #algorithm to do #LossyCompression for things like images (e.g. jpeg). Most of the information is dropped and only the most important information is retained, which when reversed provides a noisy but recognisable version of the original image.
If this happens with a simple matrix of multi-dimensional vectors (e.g. a bitmap) could it not also be done with word embeddings like #word2vec to perform lossy compression on text? Is this a thing?
I'm so curious as to what the result would be. Like could you use it as a tool to distill the meaning in text either by reducing the amount of text, or just to reduce a large corpus of text to just the gist, without having to store the whole thing? Obviously this would be highly problematic and probably a bad idea, but maybe humorous at least?
Another thought I had would be to use it to fit a larger corpus into a limited number of parameters when priming a generative network.
If this idea has any merit, which I'm sure it doesn't, it's hard to imagine that somebody who actually knows about this stuff (unlike me) hasn't already tried it, but I haven't yet found any references. Then again, I'm not really sure what to search for.
I've found some rather humorous examples of lossy text compression attempts, for instance: https://hackaday.io/project/5689/gallery#0904354f9f40a934977415877b354407
So since writing I’ve found loads more academic papers in this area, but haven’t had time to dive in. But @aparrish has some amazing things in this area which I’m only just scratching the surface of. Incidentally she has done exactly what I had in mind with FFT and word2vec and presented it in this talk from 2016 and the result is perfect, no notes. https://youtube.com/watch?v=meovx9OqWJc&si=EnSIkaIECMiOmarE