For context I created a video search engine last year, I shut it down and put the data online. You can read about it here: https://www.bendangelo.me/2024/07/16/failed-attempt-at-creating-a-video-search-engine/

I put that project on hold because of scaling issues, anyway I’m back with an other idea. I’ve been frustrated with how AI slop is ruining the internet and recently it’s been hitting Youitube pretty hard with AI videos. I’m brainstorming a tool for people to selfhost:

Self-hosted crawler: Pick which sites/videos to index (blogs, forums, YT channels, etc.). AI chat interface: Ask questions like, “Show me Rust tutorials from 2023” or “Summarize recent posts about homelab backups.” Optional sharing: Pool indexes with trusted friends/communities.

Why? No Google/YouTube spam—only content you choose. Works offline (archive forums, videos, docs). Local AI (Mistral) or cloud (paid) for smarter searches.

Would this be useful to you? What sites would you crawl? Any killer features I’m missing?

Prototype in progress—just testing interest!

  • @Xanza@lemm.ee
    link
    fedilink
    English
    152 months ago

    Why the hell does everything have to be AI for you people to be happy? I just plain don’t understand it. We know that AI hurts your critical thinking and reasoning skills, and we continue to just pack AI into everything… Doesn’t make sense. Sooner or later you’re gonna need to ask ChatGPT whether or not you need to wipe your own ass or not.

    • @wise_pancake@lemmy.ca
      link
      fedilink
      English
      92 months ago

      There are various levels of AI here

      Storing embeddings/vectors in a search index can make your searches smarter and more relevant. The embeddings squeeze related concepts closer together than pure keyword approaches, which if done well increases retrieval quality.

      RAG tools and AI searches are just a layer on top of your index. When done well these can be really useful in annotating your results and speeding up finding things.

      That’s useful when you’re searching say an error message and the AI is able to iterate on keywords and skim a Guthub issue about it and skip to the resolution.

      Similarly it’s good when you’re researching something but don’t have the exact words, AI search can iterate and capture your intent, then run several queries based on that.

      I don’t find the hallucination problem significant in practice with a lot of AI search tools, but I have found AI is vulnerable to certain types of SEO spam that a human would never fall for.

      As an example most companies have a “comparison to” or “alternatives to” blogpost. The AI does not critically look at the fact that a service is hosting a blogpost shilling their own product. So asking search AI for options is actually poor quality because it will return the shilled results that appear in search first.

      AI also search adds an additional silent layer of filtering, which you need to be conscious of.

      • @rebelflesh@lemm.ee
        link
        fedilink
        English
        22 months ago

        But is a search engine we actually figured those out a few years ago, what advantage is AI going to bring? Do we also need ai wheels now?

        This is the smart thing all over again, I don’t need a smart toilet or a smart toothbrush.

    • irmadlad
      link
      fedilink
      English
      22 months ago

      For the average consumer of AI, it’s a novelty at this point, even tho we have been using pieces parts of AI for a long while now. But it’s getting it’s stride in stuff like face swaps, neat tiktok videos, making weird pictures. I liken it to when ‘the cloud’ came to town. Hell, we’ve been uploading to servers and running apps on servers for a long while before ‘the cloud’ happened. Everyone and their brother trampled each other to move their entire operations to the cloud. Then, as the dust all settled, we started realizing that not everything that could be in the cloud, should be in the cloud, and so things got back to normal. But just the words ‘the cloud’ made CEOs jizz their pants at one time.

      Sameie, sameie with AI. It’s a selling point. There was a thread here I believe, talking about an AI rice cooker. The ‘AI’ part sells it, even tho we’ve been making excellent rice for millennia. I use AI. I find it a faster way to cut through all the searches and give bulleted points to deviate from. I realize that it’s not best practice to rely on AI’s word, but use it as a springboard into further investigation.

  • @rtxn@lemmy.world
    link
    fedilink
    English
    8
    edit-2
    2 months ago

    No. I’m so bloody fed up with AI “search” solutions that return everything on the fucking planet except what I want. Text search has been a solved problem for a decade. All I want out of a search engine is to be deterministic, stable, and reliable, and to look in titles, descriptions, and keywords. Vibe processing is completely unnecessary and will only create issues.

    If you really want to iNnoVAte, then consider creating an index with transcripts and summaries that users can search by keywords.

  • @DarkSpectrum@lemmy.world
    link
    fedilink
    English
    72 months ago

    AI uses so much more resources than standard search engines and it comes at a time when the whole planet needs to slow down climate change

  • @marauding_gibberish142@lemmy.dbzer0.com
    link
    fedilink
    English
    42 months ago

    I think SearXNG already has AI integration. Not sure how it works though. I don’t think that I would personally use AI for things other than summarising what I search but it is a useful feature to have

      • @marauding_gibberish142@lemmy.dbzer0.com
        link
        fedilink
        English
        12 months ago

        Sorry, I was wrong. I think I probably saw it in a blog post where they mentioned creating an AI search engine using SearXNG and Ollama. I don’t see any mention of native Ollama integration in the SearXNG docs

  • @solrize@lemmy.world
    link
    fedilink
    English
    42 months ago

    AI has become an abbreviation for “bad” and I wouldn’t want that, but yes, I’ve been interested for a while in building language models into search engines, to give the queries more reach into the document semantics. Unfortunately, naive approaches like looking for matching vector embeddings instead of (or alongside) search terms seems near useless, and just clutters up the results.

    I’d be interested in knowing what approaches you’re using. FOSS, I hope.

  • @Ptsf@lemmy.world
    link
    fedilink
    English
    22 months ago

    Indexing websites adds significant traffic to those sites. It’s not a good idea for the health of the internet for everyone to be Indexing, maybe you should search for a precompiled index you can train the lmm on and distribute it daily. Or do the crawling yourself and distribute that index.

  • Avid Amoeba
    link
    fedilink
    English
    22 months ago

    I think the really useful idea here is solving the scaling issue by limiting the source sites to a known good set. 95% of the time I am not looking for results from unknown sites. In fact I actively work to get information from the sites I trust.

  • @Jakeroxs@sh.itjust.works
    link
    fedilink
    English
    2
    edit-2
    2 months ago

    Seems nifty, bake in stuff like selecting your AI provider (support local llama, local openAI api, and if you have to use a third party I guess lol) make sure it’s dockerized (or is relatively easy to do, bonus points for including a compose)

    OH being able to hook into a self host engine like searxng would be nice too, can do that with Oogabooga web search plug-in currently as an example.

  • EarMaster
    link
    fedilink
    English
    12 months ago

    While almost everyone here seems to hate AI (maybe for the wrong reason, but who am I to judge) I like to have AI as it is able to provide answers a simple search engine cannot.

    What I don’t see is hosting something like this myself. The managing of source and indexing them would take too much of my, my server’s and the web servers to be indexed energy (maybe I am wrong).

    There are already good solutions (OpenWebUI with Ollama) that can be tweaked to almost do what you’re describing and the AI models get better every month, so I don’t think a custom AI search engine could keep up with it.

  • @Fedditor385@lemmy.world
    link
    fedilink
    English
    1
    edit-2
    2 months ago

    People will pay for solutions to their problems, and most people and companies don’t seem to want to hear - that we have the problem of AI being in everything.

    The next BIG THING will have a single marketing label - no AI inside.

    Actually, I need to update my github repos with “No AI inside” labels, stickers, etc. Might bring in more visibility.

  • @SoftestSapphic@lemmy.world
    link
    fedilink
    English
    0
    edit-2
    2 months ago

    Web scrapers are all that’s needed,

    AI is worthless except for the few uses it has combing through medical data.

    AI should never be used to try to influence people.

  • @Harlehatschi@lemmy.ml
    link
    fedilink
    English
    -12 months ago

    Why would I need AI for that? We should really stop trying to slap AI on everything. Also no, I’m not that big of a fan of wasting energy on web crawlers.