Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's bound to be a way to turn a stream of bytes into a stream of unicode code points (at least I think that's what python is doing for strings). Though I'm explicitly not volunteering to write the code for it.


    import mmap, codecs

    from collections import Counter

    def word_count(filepath):

        freq = Counter()
    
        decode = codecs.getincrementaldecoder('utf-8')().decode
    
        with open(filepath, 'rb') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        
                for chunk in iter(lambda: mm.read(65536), b''):
            
                        freq.update(decode(chunk).split())
            
                    freq.update(decode(b'', final=True).split())
        
                return freq


Oh that's neat, though I might split this into two functions in most cases, no need to entangle opening the file and counting the words in a filelike object.

That's two neat tricks that I'm definitely adding to my bag of python trickery.


Sure, but making one string from the file contents is surely much better than having a separate string per word in the original data.

... Ah, but I suppose the existing code hasn't avoided that anyway. (It's also creating regex match objects, but those get disposed each time through the loop.) I don't know that there's really a way around that. Given the file is barely a KB, I rather doubt that the illustrated techniques are going to move the needle.

In fact, it looks as though the entire data structure (whether a dict, Counter etc.) should a relatively small part of the total reported memory usage. The rest seems to be internal Python stuff.


I dislike loading files into memory entirely, in fact I consider avoiding that one of the few interesting problems here (the other problem being the issue of counting words in a stream of bytes, without converting the whole thing to a string).

If you don't care about efficiency you can just do len(set(text.split())), but that's barely worth making a function for.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: