Search at Slack | WingNews

[+] subpixel|9 years ago|reply

Slack is great, but it's adoption by popular opensource communities is problematic.

Why? Because opensource communities are on the free plan, which limits search once you have 10k messages. I've had experiences where I wanted to revisit a question I had asked in a Slack channel the previous week, and been unable to find it.

As a result, everyone burns out faster b/c the same questions get asked and answered over, and over.

Couple this with the fact that channels are not indexed by Google and you get a black box where valuable Q&A content and discussion goes to die.

[+] jakebasile|9 years ago|reply

I agree. I am dismayed when I see open source projects using Slack in lieu of IRC or a mailing list. It means I'd be forced to use their awful client (which is slow, buggy, and far too resource intensive for a chat application) or use their awful IRC integration. This is all in addition to the issue you raise of Slack being a black hole beholden to a profit motivated entity.

Just use IRC. It's practically impossible to avoid Slack at any startup now, but I'd love to be able to avoid it in FOSS.

[+] austenallred|9 years ago|reply

That's an easy problem to fix http://slackarchive.io/

[+] accountface|9 years ago|reply

Why are people using a chat app like a wiki to begin with

[+] timClicks|9 years ago|reply

It's also problematic in that it chisels away at what it means for a project to be open source. The focus seems to be purely on the software licence, and increasingly less about the wider project tooling.

GitHub and Slack provide a huge amount of utility. But they also feel hollow to me. It feels harder and harder to opt-out of using closed tools.

[+] zanny|9 years ago|reply

What is Matrix / Riot missing today to get these open source communities to start using an open source federated chat room protocol?

[+] jonbaer|9 years ago|reply

The 2 links for UI/UX comparison ... https://support.discordapp.com/hc/en-us/articles/11500046858... https://get.slack.help/hc/en-us/articles/202528808-Searching... ... I think on Discord it is based on the context of what the secondary thing is you are doing outside of chat (gaming - it will show you what game the others are playing - or any app based on file process - which is pretty slick) and on Slack it will be primarily thinking you are at work so it's all about quicker access to files/presentations/etc.

[+] wcummings|9 years ago|reply

IRC isn't that hard. Just register a channel on freenode (which exists for the purpose of facilitating open source!), there's already a web interface [1]. You can stick a nice link on your project's webpage. It's even less work for users than Slack, you don't have to register, just punch in a nickname.

Someone in your project can manage to setup a logbot that dumps logs onto a webserver, which will be indexed by google. I suspect there are services that will do it for you, so you might not even have to setup the bot yourself. If there isn't one I'd have half a mind to build one, if it gets more projects using IRC.

[1] https://webchat.freenode.net/

[+] fletchowns|9 years ago|reply

I have experienced this as well with the Chef Community Slack channel. It is a wonderful resource and it's super convenient to have easy access from my phone, etc but there is so much useful information in there that won't be accessible by others in the future.

[+] ankit-singh|9 years ago|reply

Looks like solr is being used to rank messages using a few features and then re-ranking the top n at application layer using another set of features. This would constrain the search quality as 1) you have no control as to how these two set of features interact and 2) messages ranked low based on first set of features could be highly relevant according to the second set. Another possible approach could have been using a custom scorer to influence scoring at lucene level, thereby combining all the features at a single point. Was this approach evaluated? If so, any insights as to what could be a limitation?

[+] isabellat|9 years ago|reply

Great question. We rank in two stages for a number of different reasons. First, it would be too expensive from a performance perspective to rank all of the messages in your corpus. Second, some of the features that we use to rank are much more easily accessed at the application layer. It would require more of an engineering effort to make these signals accessible in SOLR. The first pass which is done in SOLR is a high recall, low precision pass. The second pass through our custom ranker is a high precision pass. It is possible that we would lose some messages that might end up being important in the first pass but it's a tradeoff between performance and accuracy. Hope this helps answer your question.

[+] Eridrus|9 years ago|reply

Besides the technical answer already given, this is a pretty standard architecture for search ranking and other problems where your fine grained decisions are too expensive to run on everything.

[+] dgreensp|9 years ago|reply

The subject of how bad Slack's search is comes up all the time in talking to friends and co-workers, and I wonder if the described ranking changes are enough.

Usually when I'm searching, I'm looking for a particular message, possibly even one I read earlier that day, and I may know a few things about it, like who sent it and that it had an important link, but I still can't necessarily find it! The results are also presented in a giant cartoony way that makes me page through many pages. Tokenizing my search into "keywords" means that even if I know a substring of what I'm looking for, it doesn't come up as relevant, or the tokenizer tokenized the text differently. This is also why GitHub search can't find a lot of things.

What I would want in a search experience is the equivalent of Control-F over the list of messages I've actually seen.

[+] jliszka|9 years ago|reply

Stay tuned! We're working on improving phrase matching as we speak.

As for the Control-F thing... stay tuned on that too :)

[+] leothekim|9 years ago|reply

This sounds like a step towards making Slack more of a knowledge repository and possibly a wiki replacement. One of my (many) qualms with knowledge repository tools like Google Sites is that searching them is basically useless. Another is that knowledge in these repos becomes stale really quickly. If you can put meaningful information in a Slack post, you can take advantage of the recency-focused nature of chat and smarter searching algorithms like the one described here, and essentially make an internal Google for your organization. Kudos to the Slack team, very interested to see how this evolves.

[+] trafficlight|9 years ago|reply

I've been searching for a solution to this as well. Our Slack holds an amazing amount of information, but it's really difficult to curate that information efficiently.

It'd be cool to highlight a piece of information and insert it into a wiki-style site.

[+] joe_fro|9 years ago|reply

This is a very informative article. If you're interested in getting started with search relevancy I would also suggest the book: https://www.manning.com/books/relevant-search

Which was very helpful to me.

[+] garysieling|9 years ago|reply

I second this recommendation, I used Relevant Search to build https://www.findlectures.com

[+] softwaredoug|9 years ago|reply

Thanks! (This is Doug Turnbull :-p).

[+] donretag|9 years ago|reply

I enjoy reading Doug Turnbull's blog and most of his writing, but I found this book tedious. I purchased it despite reading that terrible first chapter which is available as a free sample. Perhaps it could be that I am already too well-versed in the subject.

Any book where you learn at least one thing new is always worth it, so I do not regret having this book in my library.

[+] isabellat|9 years ago|reply

Thank you for the book suggestion. Happy you found the article informative!

[+] amelius|9 years ago|reply

Open-source desperately needs more search-tool projects.

Lucene/Solr/Elasticsearch are nice, but they need competition, especially outside the Java world.

[+] dvirsky|9 years ago|reply

Shameless plug: I'm working on RediSearch, an open-source, in-memory Redis module written in C that does search. Not nearly full-featured as Lucene and friends, of course, but it's a very young project. http://redisearch.io

[+] scaryclam|9 years ago|reply

I'm not really sure what you're trying to say in your comment. Sure, there's a lot of Java going on there, but does that really matter? They're tools and you can interact with them from the language of your choice.

I get the competition part, but none of the above are exactly stagnant, so I'm wondering what you'd like to see more competition achieve.

Not trying to be difficult, just curious in case I missed something from your comment :)

[+] nswanberg|9 years ago|reply

Take a look at http://bitfunnel.org/, an open-source version of some of the algorithms from Bing. It's not ready for production use but you may enjoy following along in the implementation of a large search engine. The design notes and blog are also worth reading.

[+] slantedview|9 years ago|reply

IMO this is like saying Hadoop or Spark or Kafka needs more competition outside the Java (JVM) world. Elasticsearch, at least, has a dead simple web API which makes it accessible from any platform, along with a slew of clients and integrations with other tools/platforms.

[+] lightbulbjim|9 years ago|reply

I've used Sphinx (http://sphinxsearch.com) in a previous job. If it fits your needs then it's pretty nice.

[+] whateveracct|9 years ago|reply

Why do you care that those black boxes are implemented in Java?

[+] dilap|9 years ago|reply

I thought this was going to be an announcement about how they finally fixed search...

A suggestion: When I search, what I want 99% of the time to happen is that the current window I'm looking at quickly gets filtered to my search query. Ranking doesn't matter, just show exact matches ranked by time.

1% of the time, I want something else.

[+] jliszka|9 years ago|reply

Stay tuned :)

[+] isabellat|9 years ago|reply

Hi, I'm one of the authors on the post, happy to answer any questions.

[+] throwthisawayt|9 years ago|reply

This sounds like a really interesting problem to work on. Are you hiring anyone for this team (esp those new to search)?

[+] iloveluce|9 years ago|reply

Awesome article! The article mentions the signals that the model found were most significant for a message. Curious to know if they're listed by order of significance in the article? And if not was wondering whether one or two signals were predominantly more significant than the rest.

[+] g12mcgov|9 years ago|reply

Is there any possibility in the future users will be able to search ALL messages from a channel/private message, ever? It seems like Slack search cuts off after a certain point, and doesn't index into archived messages.

[+] danpalmer|9 years ago|reply

Today I learned Slack has a "relevant" option for search terms. Maybe I should try it again - I had stopped using search entirely because of the results being fairly irrelevant.

[+] samcrawford|9 years ago|reply

My experience is similar. Slack is great, but the whole search experience remains terrible for me. Results are largely irrelevant, jumping back and forth into conversations takes a long time, the UI sidebar has very little space, and if I'm looking for a file it's hard to remember if it was sent as a URL or as a file in slack. Perhaps I'm spoilt by Gmail search, but this is the one area of slack that I think is sorely lacking.

[+] rajhans|9 years ago|reply

Great work! A few questions for the author(s): In the article, you have listed 9 feature extractors/templates. In the final model, what's the total number (or rough magnitude) of features? How much data (or ballpark estimate) did you train this on? Did you try to deal with potential difference in distribution across your data sampling sources?

[+] kusmi|9 years ago|reply

I've never used Slack because last I remember they don't allow local installation, therefore no access to any documents that were dropped in, is this still the case? I have been using mattermost instead, and wrote a bot which processes all documents uploaded into mattermost, extracts metadata, creates tags and summaries for each, archives the documents in an ECM -categorizing them by their tags, and opens them up for full text solr search. My understanding is that a custom solution like this wouldn't be possible with Slack, or would otherwise require more hacky solutions?

[+] paulcole|9 years ago|reply

> they don't allow local installation

This is like saying Ford sells cars they don't allow you to fly.

If you want something that flies, buy a plane.

[+] bearcobra|9 years ago|reply

I'd love it if they made it easier to search just the channel or conversation you have in focus. Having the filter autocomplete based on what channel your on seems like a good middle ground.

[+] isabellat|9 years ago|reply

If you use the shortcut 'Command-F' it will search the current channel or conversation by auto-populating the search bar with 'in:channel_name'. Hope that's what you are looking for! Here's a list of all shortcuts: https://get.slack.help/hc/en-us/articles/201374536-Slack-key....

[+] softwaredoug|9 years ago|reply

Neat. We've been building a Learning to Rank plugin for Elasticsearch. Feedback and contributions very welcome

https://github.com/o19s/elasticsearch-learning-to-rank

[+] lacksconfidence|9 years ago|reply

I've also been building a learning to rank plugin for elasticsearch, we might want to sync up.

I hadn't set it up to sync to github as it was just internal development, but i've started the sync and it will show up at https://github.com/wikimedia/search-ltr soon.

It's got a bit more of the integration with elasticsearch put together, including storing models in cluster state and a rest interface for managing them. It's a bit more of a direct port of the solr plugin rather than a rewrite from the ground up so there are also some oddities that don't yet make sense. Refactors will certainly be done. It's also tied a little less directly to RankLib, such that i can convert and load in MART models trained by lightgbm or xgboost which have done pretty well in my offline tests and are able to utilize resources on my training machine much more efficiently than ranklib's LambdaMART (although in terms of results, the ranklib implementation is pretty good).

[+] jtoberon|9 years ago|reply

Question for the author: how do you actually deploy your model? Do you have a dependency on Spark in your production system?

[+] isabellat|9 years ago|reply

Since its just a dot product between the learned weights and the feature vector, we do this in the application layer as lacksconfidence surmised.

[+] lacksconfidence|9 years ago|reply

Being that this is a SVM, which is typically evaluated as a simple linear sum of weights, I imagine they reimplemented that in the application layer. Would be curious how they handled the normalization steps (reimplement that as well?)

[+] unknown|9 years ago|reply

[deleted]

[+] notforgot|9 years ago|reply

Hey, Slackers, you can't find what isn't there.

Most of the time I search for information I need is because I don't know anything about that part of the software. I never found this kind of information in Slack.

Parse the company docs, or our rep, and now we're talking.

91 comments