Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>> stop words

It's a good thing neither Bing nor Google remove so called stop words, because if they did I wouldn't be able to google them for the lyrics from that great song by "The the".

If you use a system that require you to remove stop words from your lexicon I suggest you find better tech because removing them destroys linguistic context. If you're using a search engine that doesn't care about linguistic context, again, you should find better tech.

Even the most naive search engine should at least be using tf-idf or some other statistical tool to determine both a token's document frequency as well as its lexical frequency. That's very important context to have.

If you insist on using stop words most tokenizers would assume you did the filtering or they'd want you to submit a word list for it to use as filter. Be aware, you are not making tokenization much simpler, because a proper search engine stores each distinct token once and only once in its index, so you hardly saved any space. What you did was to slow down the indexing process and use lots more CPU cycles for each token which translates to energy waste in my mind.

Leave your stop words right where they are, and save the planet.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: