Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How should organize and back up 23 TiB of personal files?
46 points by pushedx on March 19, 2024 | hide | past | favorite | 100 comments
A somewhat daunting project that I've been putting off for a long time is organizing and backing up 23 TiB of files spread across 40+ external and internal hard drives that I've collected throughout my life. There is a variety of filesystem types and interface types.

I take a lot of photos, so a lot of these are files that I would actually want to back up, but many of them are old operating system installs and other "useless" files that I don't need archival storage for.

The actual size of the data that I need backed up I would estimate at around 6 TiB.

A few of my requirements:

1. I don't need the files to be accessible online, in fact, I would prefer if they were not.

2. If anything is backed up to the cloud, I want pre-internet-encryption with keys that only I know and control.

3. I want something simple, that could be recovered using a pragmatic approach and open source software in case of a disaster.

4. I'd like a system where I can easily test my recovery strategy.

Open questions:

1. What local filesystem setup should I use? Number of drives? Local backup approach?

2. If you've done this before, is there a strategy that you used for the actual aggregation of the data? Are there any particularly convenient IDE to USB docks? Any good software that you would recommend for locating duplicate files?

3. What remote backup software should I use?

[ edits ]

Answering some questions from the comments:

Cost: Given a quick look at the cost of archival cloud storage, I guess I would be willing to spend up to $60 per month on a remote copy. (Noting the estimate of 6 TiB of "acutal" data)

How often: I would expect to access the remote files rarely (maybe once a month), and need a complete recovery very rarely (with a low requirement for recovery speed). Local backups I would like to occur at least weekly, with a verification or access frequently (daily?).

Risk tolerance: For the local-encryption-for-remote-storage aspect, I would like something with a high level of confidence in the cryptography and the implementation surrounding it. I would also like a high degree of confidence that I can recover my files in case of a natural disaster or similar that wipes out my local copies.

Local security: I live in a relatively secure home in a relatively low crime area. I could store a copy at a relative's house, although I may move far enough away soon that it would not be practical to deliver or access such a backup.



You should look at Borg for the remote backup software. It does automatic deduplication; has the security posture you're asking for ("untrusted server")*; and is agnostic about which backup cloud provider you use it with. Of course it's FOSS.

https://en.wikipedia.org/wiki/Borg_(backup_software)

https://news.ycombinator.com/item?id=21642364 ("BorgBackup: Deduplicating Archiver", 103 comments)

https://news.ycombinator.com/item?id=34152369 ("BorgBackup: Deduplicating archiver with compression and encryption", 177 comments)

*You definitely don't want your private filenames leaked to data brokers, like Backblaze's clients experienced.

https://news.ycombinator.com/item?id=26536019 ("Backblaze submitting names and sizes of files in B2 buckets to Facebook", 517 comments)


Thanks for surfacing the story about Backblaze.

I had the same type of thing happen to me with the LastPass leak, which made me very wary of closed-source supposed-e2e encrypted services.

Looking at Borg, it doesn't seem to have native support for the actual transfer of the remote backup. It seems like if I used Borg it would be for a local backup, and then I would need an additional layer to sync the backup to remote storage.

It looks like rclone does, mentioned in other comments here, any others?


restic is also great


Restic is nice because it has Windows support too.


In conjunction with Borg, look at InterServer[1] for a VPS.

They provide some of the best cost-per-GiB storage. I've been using them for 3 years, and I prepay a year in advance. I've been super happy. They're cheap; maybe they'll suffer data loss, but I practice the 3-2-1 backup strategy[2]; if they lose data, I have a local backup; if by some weird chance both InterServer goes down and my house burns down at the same time, I've got another backup drive at my parents' house.

If you're more risk-averse, there's rsync.net[3], but it is substantially more expensive. However, it has really good data backup practices on its side.

[1] https://www.interserver.net/storage/

[2] https://www.backblaze.com/blog/the-3-2-1-backup-strategy/

[3] https://rsync.net/


> if they lose data, I have a local backup

1. How can you tell if they’ve lost data that requires pulling from the local backup

2. What happens if they silently corrupt the data?



Organize you "keepers" into an archive you can maintain. at ~6TB (edited "gb" typo, tx) thats not a big challenge to back up; multiple usb hard drives and regular schedule backups of that will serve. "Remote" is a distraction; put your data on media you own and keep a copy in a bank box or something if you feel the need for more off site redundancy.

Collecting and arranging the archive is the big job. you're just gonna have to bite down and start doing that, yourself: no one else is likely to know what needs saving or not. Set up a NAS or file share with a big HDD and start collecting files there.

You may find the old stuff fun to scrape up; for example how difficult is it to find a PATA interface today? that problem only gets harder. Motivation to get on the job now, rather than later; and to make the "archive maintenance" more of your everyday task list than to let it pile up.


I'm a fan of the "low tech" redundant backup, that is making an independent copy of the data for example once a year on a new hdd and put that in your parents attic.

It's not exactly matching OP's requirements, but my solution is to regularly rsync important stuff to a small home server, on that homeserver do a borg-backup of the backup folder once a week onto a second drive. First one is ssd, second spinning rust.

In ops case I think the biggest problem is sifting through all the data. I did this a few years ago with only 4 or 5 old drives and a dozen dvds and it was exhausting, but also occasionally pretty nostalgic when I found photos from college or old programs I wrote as a teen.


OP wants to store 6 TiB, not 6gb.


Now it's 23TiB... even more complicated


thank you


If you can make your dataset fit on a single 24tb disk (or less if you want cheaper disks), that simplifies the setup. I’d use a set of 3 copies:

* One main disk that you use for collecting and organizing all of the data

* One backup disk in the system with the main disk

* One backup disk kept off-site in a safe deposit box or similar

Every few months, swap the onsite and offsite backup disks to keep the offsite one fresh. When you do that, verify the integrity of the data on the disk that just came back.

Automate verifying the integrity of the main and onsite backup monthly.

My preference for filesystem would be single-disk ZFS “arrays”. That allows you to run a scrub to verify integrity easily.

For copying the data, use either zfs send or rsync. zfs send copies the filesystem directly so it preserves everything about the file (for example sparse files) but rsync is less complex.

For getting the data from your various disks, I tend to use dd to create a raw image of the disk, then use other tools like losetup to mount the image as a loop device. Then if there’s data that I want more convenient access to, I’ll rsync it out of the mounted FS. That way, you’ll never lose data from a misconfigured/faulty copy since you can always do it again from the raw disk image.


I'd recommend storing it AWS S3 with an encrypted key physically stamped onto metal cards (some crypto people use similar products for seed phrases).

Distribute the metal cards as widely as you can, at home, potentially in safety deposit boxes or buried somewhere (potentially with an additional layer of protection from a brain memory password.)

This protects against a class of physical data destruction from house fires, theft, floods etc.

However I'd withdraw this recommendation if you live in a jurisdiction where you can be compelled to hand over passwords, e.g England


good point. also need to be careful of countries where the federalis just ransack banks and take all the safety deposit boxes, too: https://www.jurist.org/news/2024/01/us-federal-appeals-court...


A NAS built from generic x86 hardware and some disks. Use ZFS, it's a bit of a rabbit hole but an excellent choice for both reliability/redundancy and as a tool to backup. I'm not gonna explain how this works, just describe what you can do, it's an option.

ZFS:

- ensures you don't get bit rot

- manages both disks (raid/mirrors &c) and the filesystem, it's an all-in-one solution

- supports block level replication to local and remote systems, after the first backup it's fast

- can create dynamic partitions to group files together and build replication strategies around

Choose either RAID or mirrored drives (https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...) I've gone mirrored but more for flexibility and performance. Use a calculator to see what options of disks you have https://jro.io/capacity/ (and google 'ZFS calculator' for others)

For backup get a second machine somewhere else in your house with a smaller setup and use ZFS replication to keep it up to date with everything on the main box you need backed up. Currently I use a raspberry pi with a USB disk but this is perhaps cutting it fine. You wanna keep this online so ZFS can periodically check the health of the data on the disks. Fully offline backups can be a risk.

Finally for a 3rd backup use some of those external drives, format to ZFS and use replication. Plug them in on a schedule and take a backup.

If you want to backup to remote systems (cloud/a box in your parents house) it also supports filesystem encryption. With the right options you can stream incremental backups over SSH only passing encrypted blocks. The system at the other end never needs to see the raw data.


I think this is spot on, and the only thing I'd add is to use something that can take ECC RAM. Rather unfortunately, that probably means server hardware which doesn't always mix great with consumer settings. I'm currently thinking about a Dell Precision workstation for this as it seems like a decent blend of server hardware in a nice, quiet package.


Do you have any nerd friends?

I just "trade" NAS space with a few of my friends for remote backups. They allocate me a few TBs on their servers and I allocate them a few TBs on mine, and we all give each other some locked down shells for remote access (bubblewrap, jail, VM, etc. everyone has a different approach).

I'd say grab a few 8-16TB disks, throw them in a raid 5/6 with a filesystem that supports snapshots and compression, dump all of your 23TB of data into it, come to a trade agreement with a couple of friends, and sync your 6TB of important data to your friends' servers (gifting them a drive >= to your storage needs can grease the wheels here).

The details don't really matter as much as the overall architecture. I personally have something like 12 drives haphazardly shoved into an old desktop case with a couple pcie sata expansion cards for extra ports, all dm-crypt'd and gathered into a btrfs raid6. I regularly rsync my other systems to the backup server and take a snapshot after each sync, so I have incremental historical backups for every live system. Backups of retired systems are just static and never get touched. The really important stuff gets synced to friends' servers using restic (encrypted, incremental).

You can make the whole setup very flexible and reliable at a very low monthly cost if you just work with what you have and get your hands a little dirty instead of relying on commercial services.


Backblaze is $6/TB so should be cost effective for this amount of data.

I think all the recommended backup SW (restic, duplicity) encrypt before storing. I use restic but haven't exactly been hugely exercising it, so can't place much weight on my experience . But it should be okay with this quantity of data.

Generally I'd guess that diversity of backups wins over using a single expensive one in terms of reliability, but at the expense of more overhead managing them.


I've had great luck using rclone to encrypt and send to backblaze b2 for our offsite backups.

See: - https://rclone.org/crypt/ - https://rclone.org/b2/


Rclone is great, but I'd be wary about using a non-backup tool to do backups. You want something that makes it really clear when you are going to overwrite. if a file has changed, rclone is going to push the new version. A backup tool would retain a copy of the old version. If the file was corrupted, rclone just corrupted the backup too.


Interesting. Does the rclone versioning not work as described? I fortunately have never had a reason to test it.

> When rclone uploads a new version of a file it creates a new version of it. Likewise when you delete a file, the old version will be marked hidden and still be available. Conversely, you may opt in to a "hard delete" of files with the --b2-hard-delete flag which would permanently remove the file instead of hiding it.

https://rclone.org/b2/#versions


I didn't know you were using that - that's actually a feature of B2 (and S3). IT works as far as I know (it does in S3). Assuming you switched it on on the bucket (not the default in S3! don't know about B2).

The limitation is that it's going to be somewhat more tricky to retrieve a consistent snapshot, because each file is versioned individually - rather like using CVS instead of git.

However S3/B2 versions are still a nice feature because they allow a simple way to provide the property of 'non-fate-sharing' which is a great property to have in backups but is otherwise difficult to arrange without multiple machines. Essentially, two systems fate-share if someone who hacks one can delete both. So Ideally the machine being backed up should not be able to delete the backups, and the machine (or service) hosting the backups should not be able to delete the original. But, dumb storage does not have this property (if you keep it online) and using an online service as dumb storage doesn't either. But if you use a credential which isn't allowed to delete versions in your backup tool, enable versions, and keep your backblaze admin credential elsewhere, then you have non-fate-sharing. So, it's a good idea to enable versioning even if you use restic which does its own versioning (and I don't think this will eat more space, because restic names files by content)

Another limitation of using rclone as a backup tool is that different rclone backends support different metadata (because the underlying systems do). So, it's not suitable to back up the whole of your OS, or something where you care about not losing the metadata. Eg, the b2 backend doesn't support metadata, so if you back up your OS using it you won't know what file was owned by who.


Which one does ?


Both the ones I cited above do.


I’d keep it as simple as possible so also people who are not you can access it if needed if something happens.

My current combination is just Backblaze + TimeMachine for local backup. I also mirror it to a Synology with CarbonCopyCloner.

All the encrypted, cli tools I have used in the past I abandoned again as it was always annoying to maintain and monitor and impossible to explain to anyone.


How do you clone via CCC? I’ve an identical setup with TM and Backblaze but haven’t setup a sync to my Synology yet. As I use iCloud, I’d like an additional sync to the NAS.

Do you just run a periodic task to sync to the SMB folder?


Yep, I have two tasks in CCC that mount a directory via SMB. One for my machine that runs once a day, and one that's triggered if I attach an external HDD and just mirrors that one to the Synology.


First time hearing about backblaze

The pricing looks too good to be true. What’s the catch?


There are two main catches with Backblaze:

1. They will delete all backups of your external hard drive if you do not plug them in and fully sync them for at least ~12 hours every 30 days. You can pay an extra fee each month to have your external drives stored longer to mitigate this problem [1]. The fee is per GB so it does materially change the Backblaze pricing if you have a lot of external data.

2. Their client is extremely extremely slow, even with performance settings set to their recommendations it just obliterates CPU (irrespective of the size of the actual changeset being backed up). I've always been able to feel my computers slow way down when the Backblaze client spins up. [This issue may be resolved on Apple Silicon, but it's been a nuisance on every Intel Mac I've ever had.]

There are some other weird quirks in their macOS client, like you can set your backup schedule to "Once Per Day" overnight and it will still, multiple times per day, just start running unexpectedly and then you have to deal with the slowness or hit "Pause Backup". Somehow this happens 2–3 times every single day. I've tweaked the settings, reinstalled the client, and all of the things, but this issue has persisted for 5+ years on every machine I've used their software on. Super annoying.

Sidebar: They have a native mobile app but it is really really bad. Slow, clunky, somehow always getting logged out. The mobile client sucks so bad that I usually just wait until I'm at my desktop to do a restore.

[1]: https://www.backblaze.com/blog/backblaze-7-0-version-history....


> There are two main catches with Backblaze

Some would say there is a third. Decryption during restore is handled on the server, not on the client. They plan to support doing all encryption/decryption client side, but if that is something you need now Backblaze's client backup service may not be the right choice.

Here's a comment with more details and references [1].

[1] https://news.ycombinator.com/item?id=38862215


On the pro side: BackBlaze has always focused on making their tech lean and agile enough to stay profitable while they hit that low pricepoint.

Most users don’t use much in terms of HD space for their backups so they can absorb the extra cost for people who back up large amounts and since they are using their own lean-focused servers with consumer drives their cost per TB is much lower than a company backed by S3.

They are also the company that puts out drive stats every quarter.

And they open-source the hardware for their rack-mount “pods”.

They also added another revenue stream: B2. It’s an S3-compatible storage service with a decent profit margin that still manages to be 1/4 the cost of S3.

TL;DR: They have a solid business model built around providing that low price.


Tape drives are a viable option. 6TB LTO-7M8 cartidge can be has for $50.

The drive is expensive, at $4k, but it’s still the perfect solution for long-term archival, because the cartridges weigh and cost damn near nothing.

Yes, there’s large upfront cost, but after that the cost per MB will slowly approach the rock bottom price of how much the cartridge costs.


LTO-6 drives can be had for a lot cheaper used. I have two external ones now. The tapes of course aren't as big but they're big enough to still be pretty useful at 2.5TB.

My LTO-6 drives cost about $250 each. They annoyingly use SAS instead of something like USB or Thunderbolt, but I already had a rack mount server to install a SAS card in.


This is all contingent on four things you haven't told us: cost, how often you expect to access it, your threat models and risk tolerance for each of those threats, and what infrastructure you have access to (aka how safe is your home, can you store a backup safely with your parents, can you store drives in a lockable private office at work safely?)

For example, if your house burns down, it doesn't matter if you have 10 mirrored copies of the same 8 TiB drive in the same room. If your parents get a new housekeeper who goes on a cleaning spree, it doesn't matter if you have 10 mirrored copies of the same 8 TiB drive in the same shoebox.


Cost: Given a quick look at the cost of archival cloud storage, I guess I would be willing to spend up to $60 per month on a remote copy.

How often: I would expect to access the remote files rarely (maybe once a month), and need a complete recovery very rarely (with a low requirement for recovery speed). Local backups I would like to occur at least weekly, with a verification or access frequently (daily?).

Risk tolerance: For the local-encryption-for-remote-storage aspect, I would like something with a high level of confidence in the cryptography and the implementation surrounding it. I would also like a high degree of confidence that I can recovery my files in case of a natural disaster or similar that wipes out my local copies.

Local security: I live in a relatively secure home in a relatively low crime area. I could store a copy at a relative's house, although I may move far enough away soon that it would not be practical to access.


You should separate your long-term and near-term storage and treat them as different backup projects. For the near-term that you are backing up weekly and thus likely need to recover faster and easier in an emergency, borg -> backblaze with your own keys.

For the long-term backups that sound more sentimental, the "put lots of 6 TiB HDDs from different manufacturers in safe places" becomes more feasible when you don't have to update it weekly and when it is OK to wait a few days until so you can drive to a relative's house after a disaster. But you also have to test those drives every now and then.

I have a bunch of long-term drives at my parents' house a 5 hour flight away, but I visit them every winter and do a test every year. If I needed those today, it might cost me $1000 to drop everything and get on a plane. But if I'm visiting every winter and it isn't really urgent, I can wait. You should think of those kinds of costs too.


The main problem is that you don't have the time and energy to go through 23TB of files. You want a simple solution to pack them all in a safe place and sort them later (which will never happen, but it's ok, we all have this problem).

Here's what to do:

1) Get a few very large HDDs, the largest you deem afordable. In total, you should have about 4x - 8x the capacity of your current files.

2) Split them in two groups: main storage and backup storage.

3) On each group, create a btrfs RAID1 and mount it with compression. Start with a btrfs raid1c3 if you can afford 3 drives, and downgrade to raid1 if you run out of space.

4) Copy all your files there. You may want to make a directory for each media you copy data from. You should extract all archives that you may already have created and let btrfs deal with compression - this allows for deduplication.

5) Run a deduplication tool that supports btrfs (https://github.com/Zygo/bees). Creating btrfs with sha256 checksums is probably better for block deduplication while using crypto acceleration available on most systems.

Safety rules:

- Only connect the backup drives when you do the backup. Keep the backup in the closet or, preferably, in another building (in case of fire, etc.).

- Label the drives. You don't want to accidentally mix the two groups.

Later, when you run out of space:

- buy a pair of even bigger drives and add them to the btrfs volume

- remove the old drives from the volume (this step is very time consuming)

or

- have a huge case or hdd rack and keep adding drives without removing any

One last thing: you may want to use a NAS or DAS, at least for the backup group.


Heh, I'm in a similar, though somewhat milder situation. Always one hard drive failure away from losing some (albeit non-essential) files.

As a first low-budget bandaid I got a single 16TB enterprise HDD (Toshiba MG08ACA16TE) in an external dock that is plugged in to a Prodesk 400 G6.

I regularly snapshot and sync all my btrfs drives with btrbk (all my linux devices and several of my external drives are btrfs). The Toshiba is being deduped with bees to not make this too inefficient.

For important files I have syncthing folders that continuously synchronize between my devices, as well as above solution.

For some dead media I use blurays. Still looking for a solution for Windows, though there I at least have OneDrive.

It's a mess to be honest.

Eventually I want to have big enough RAID for file centralization, restic cloud backups, network shares and so on.

A crucial piece I'm still missing is a synchronization, hosting and backup classification/policy for the various file types I have... This may well be the more difficult thing compared to just getting any redundancy.


Deletion is also a big thing. identify and delete old, worthless stuff to make the issue more manageable. On my Windows machine I use WinDirStat to identify big things, and decide what could do with being deleted.


Dock recommendation: https://www.aliexpress.com/item/33045573142.html

Make sure you buy the USB3 variant. There are lots of USB2 that look exactly the same. Some sellers may scam you and send the wrong one.

It is power hungry even when not in use, HDD slots are not hot-swap (you need to turn it off) and card reader is crap, but it is very convenient to swap HDDs quickly and it can read 1 sata and 1 pata at the same time.

There's an alternative from "Orico" that I can't find atm, which supports UAS, but no PATA.

There are lots of usb to sata + pata 40pin + pata 44pin adapter cables. I can't say how good those are, but desktop drives would need 12V power which USB doesn't provide and I would hate to have another power adapter and cable on my desk.


You didn't mention a budget (monthly or tolerance for one-time expenses).

I just copy everything to HDDs using a Plugable USB-C/SATA dock (nicer and far more reliable IMO than those $9 dongles you see around). I then put the drives in a Turtle HDD case for padding against environmental factors. That protects me against everything except house fire/theft/tornado sucking up the case and dropping it in another state.

My backup needs are beyond any single drive, but at 23 TiB you don't have to purge much to fit on a single hard drive...there are 20 and 22 TB models for sale.

I'd buy a drive WAY larger than 6 TiB just in case you underestimated how much you actually want to save. Having the extra space would also allow you to incorporate error-correction techniques like generating PAR2 files (I did that with some emotionally-important personal files).


Except that single drive becomes a single point of failure. I wouldn’t recommend doing this if you care about your data.

I did this once with a buffalo raid nas. It died and I was left with 10 years of my life inaccessible. Much sadness ensued.


Once you consolidate, you can buy a second one/third one/etc to make copies.


The simplest solution is the best when it comes to backups (and almost everything). Since you don't need always online, we can eliminate cloud and NAS. Use a 8Tb portable Drives as your backup is enough. With more elimination and compression, you can likely get your storage needs down further. You can use another one for offsite backup. Keep it at a different location like your office or a friends. Use some software like Restic or Borg to encrypt and maintain snapshots of your backup. You can find systemd scripts that automate all of this. Lastly, Periodically test your backup works as expected.

This is what I do basically, but additionally I maintain the current year of photos on Google photos too and use Drive for personal document storage so that its like a live 24x7 cache of recent data. This has worked for ~10 years.


Be lazy like me. $400 will get you 2x16GB drives and a 2-bay Synology NAS. Tell it to mirror the drives to each other and keep history. The defaults are fine. Then mirror the most important files with it's cloud sync function (GDrive, etc). This is in case your house burns down. I know you dislike "online" but the fact is cloud datacenters have pretty good redundancy. Plug external drives into the USB ports for unimportant and replaceable files. Meditate with idea that you don't actually need to organize anything since the NAS will index it all so you can search. Call it a day.


That sounds pretty cheap (unless you really meant 2x16GB ;)). Got a link to the products?


Whoops TB. It was a while ago, but what I did was set a "deal alert" on slickdeals for "NAS" and "16TB shuckable". I think it's a Synology 218+. I vaguely remember it was about $200+$100+$100. I had to "shuck" the drives (meaning throw away the case). For some reason they're cheaper in an external USB case, and the internet calls them "shuckable" if they secretly contain high end NAS-friendly models. I got 4 and shucked 2 into the NAS bays and plugged 2 into the USB ports (the USB ones are for like movies and easily replaced downloads, so I can access over the network but theres no mirroring - if they die they die)


So you don't create another 40+ disks in the future, get some kind of network attached storage (or an always on computer) and use it as a backup/media store for your backing up your everyday computers and a media store. Hopefully it will outlast several computers (I have had the same Drobo with 2 or 3 disks in RAID since 2014, had 1 disk fail).

Back it up with something one of the other commenters suggest - I actually use rdist to space on a friends computer (and vice versa) that's worked unattended for over 15 years (and free!)


If you care about your data use ZFS, 3 18TB drives Raid Z1 should do, any random computer with 3 drive bays

You can set up zpool scrub every week and email yourself the results to check the files are surviving


I use borg backup, one copy goes to a local backup disk (the cheapest brand-name 4tb USB external drive I could find) for quick recoveries, another goes to amazon glacier for catastrophic situations. Cost is something like $3 a month for just shy of 4TB of data, photos and videos that I've taken myself being a large part of that.

I would love to know if anyone considers printed albums, whether of photos, textual data or some other elaborate system involving QR codes or what have you, to be part of their strategy.


For local storage, an alternative to the Synology that others have mentioned is UnRAID.

* Consolidate your drives to empty the largest ones you can. Or, buy three 20TB drives (two for data, one for parity).

* Buy/build a small server, install UnRAID on it, and create an array. If using new drives, you will have 40TB available.

* Copy each old drive's data to the array. Retain the old drives as backup.

* Once done, a) set up remote backup, and b) build a second, identical UnRAID server as another backup. Or the tape backup also suggested.


Why isn't anyone suggesting S3 glacier deep archive? Looks like about $6 per month for 6TB. Sure it'll cost $540 to pull it out, but that should be a one time thing...


Yeah I keep Glacier as an extra redundancy step to have another copy of my important data on a different continent.

Glacier shouldn't even be your redundancy solution, it should be the redundancy of your redundancy :)

Like you're alluding, if you actually need to pull the data out of glacier, you've probably screwed up in the design/testing of your primary backup systems, but we're all fallible so it's an insurance policy.


Why go glacier when you can go B2 for the same $6/mo with no egress fees[0]

[0] When you pay for 6TB you get 18TB free egress making it functionally free since you’d need to to >3 full restores a month to incur a charge


WinDirStat (or Linux equivalent) and delete the stuff you don't need. You can run it across multiple drives as long as they're all (accessible) on the same machine.


Wiztree on Windows is faster than Windirstat


I would look into Git Annex which I have been heavily using to do the same thing with almost the same exact number of files.

Its very simple to have a "dumb" drive that simply stores the files and use annex to remember which drive it is. Also, to track and remember that you only have it in one place in case you want more copies. It can also push files to s3, glacier, and some other backup repos if needed with client side encryption (code between either symmetric or gpg)


I've never done that many but rclone is famous (imo reliable) and very cross platform. It also supports an encryption layer it controls over top of generic cloud providers

So I've been enjoying b2 (as an example, you can Fuse mount to browse the files without downloading). But just for backup don't Amazon glacier or Google cloud archive is so so cheap. If you wanted to be paranoid you could do both separately

I haven't independently audited rclone's encryption layer


Google archive storage is $1.20/TB/mo. I just had a customer recommend it to me (I'm the author of HashBackup). He said he's been using it for over a year and pays around $3/mo for 2.5TB. One gotcha: the minimum storage duration is 365 days, so if you upload a 1TB then immediately delete it, you're going to still pay $14 over the next year. I really dislike "delete penalty" fees, but they're common.


I use borgbase for my cloud backups. Works great, but not for 23 TiB of data

I used to use external hard drives for backing up large amounts of data, but nearly all of them failed (4 out of 5 of my external hard drives broke. I can't get data of them because there are multiple I/O errors, even though RAID says the drive is fine. Other devices even fail to show up at all)

So I actually decided to get a Synology NAS and use it exclusively as a backup target.


I decided that nothing is important enough to me to spend any time on this. Other than financial, job, medical, insurance or tax records and the like.


You didn't note any budget requirements.

If money is no object, just buy a synology NAS drive and be done with it. They do everything you want and more, and are incredibly user friendly.

Use their BTRFS filesystem with SHR Disk groups to get multiple disk redundancy alongside data scrubbing for bitrot protection.

It also contains software to connect to any cloud provider for remote backups if that is what you want.

EDIT: Something like a synology DS1621+ would do you well.


For duplicate files I've used the following: - https://github.com/adrianlopezroche/fdupes - https://github.com/pkolaczk/fclones

The latter works well for larger datasets, outputs a TXT which you can analyse and decide what to do with.


The variety of approaches in these comments is fascinating.

Does it mean there's more than one right answer, or that no one solution is ideal?


It is because the "right answer" is contingent on personal factors like cost, risk tolerance, threat models, existing skills, and access to existing infrastructure. The right backup solution is the one that works for you.


No one solution is ideal.

Each has tradeoffs and “lock-in” (which cause people invested in an answer to avoid others)


4 bay qnap with dual 14tb disks in raid 0. From there you can encrypt locally and hybrid cloud sync to s3 glacier.


suggest not raid 0 - with a 4 bay nas you have room for a raid level with some redundancy in the case of hardware failure. S3 Glacier is a good idea to mitigate against data loss, but recovery isn't instant


Ah no you're right. I meant raid 1.

With 4 bays you can definitely fill them up and go for striping to increase capacity and redundancy but high TB disc's are $$


Buy an external usb drive, 1 or 2 bays with a 16 to 22tb drives and consolidate everything. Use backblaze for unlimited storage. If really only have around 6tb of actual data, a 2 bay external drive mirrored will be sufficient.


Any recs for brands or specific products on 2-bay drive enclosures?


You could make your own.

This is a 4 bay but it'll be configurable for any use case you can come up with. https://blog.briancmoses.com/2023/09/diy-nas-econonas-2023.h...


The datahoarding community often discuss things like this: https://www.reddit.com/r/DataHoarder/


I think you would also benefit from asking on reddit r/datahoarders.


If you can pare things down to fit on a couple of drives that you can keep attached to an online computer, Backblaze would be inexpensive and they have an option to use your own encryption key.


I use Arq and Wasabi for about 25TB of backups. Wasabi is $173/month for the 25TB, though.


Never heard of them, but cool names


I would get a synology NAS.

For offsite, I would use restic to S3 or backblaze.


Why not turn this into a nice tech portfolio demo project? Why don’t you design a global high availability data system on top of Amazon S3? Then, you could also implement native client for each OS that you use. It is a great way to learn.


Maybe before you get into the technical aspects, there's another consideration. Out of that 23TiB, are you able to estimate how much of a problem it would be, if you lost some or all of it? e.g.

* disastrous * very upsetting * disappointing * meh

(It might also help to consider when you last needed to access any of it?)

Because honestly, my bet (without judgement) is that there's likely a significant amount of data in there that simply doesn't warrant keeping. I base this on my own habits (I have to actively fight a hoarding tendency digitally and in real life) and also knoweldge of friends who (while otherwise very well adjusted) seem to find digital hoarding easy to fall into - maybe because it has less of a visible life impact than physical belongings.


23 TiB ... I would use 4 disks for that - each 10 TB or 12 TB in size (depending what room You want).

In RAID5(3) + SPARE with ZFS that would be 'raidz' mode.

    % math 12000000000000000 / 1024 / 1024 / 1024 / 1024
    10913.93

    % math 10000000000000000 / 1024 / 1024 / 1024 / 1024
    9094.94
From 10 TB disks You would have 3 X 9 TiB which means 27 TiB of space available.

From 12 TB disks You would have 3 X 10.5 TiB which means 31.5 TiB of space available.

> 1. I don't need the files to be accessible online, in fact, I would prefer if they were not.

I would keep it on a local LAN w/o Internet access.

> 2. If anything is backed up to the cloud, I want pre-internet-encryption with keys that only I know and control.

Use rclone(1) with its encryption - You can clone these files to S3 in the cloud.

> 3. I want something simple, that could be recovered using a pragmatic approach and open source software in case of a disaster.

I use rsync(1) for forever-incremental backups and rclone(1) to backup some of that into encrypted S3.

My rsync(1) scripts are here (maybe You will find them useful):

- https://github.com/vermaden/scripts/blob/master/rsync-delete...

- https://github.com/vermaden/scripts/blob/master/rsync-delete...

- https://github.com/vermaden/scripts/blob/master/rsync-delete...

- https://github.com/vermaden/scripts/blob/master/rsync-delete...

- https://github.com/vermaden/scripts/blob/master/rsync-delete...

- https://github.com/vermaden/scripts/blob/master/rsync.sh

> 4. I'd like a system where I can easily test my recovery strategy.

Also with rsync(1) ... or anything else for plain dirs/files.

> 1. What local filesystem setup should I use? Number of drives? Local backup approach?

ZFS.

Details about ZFS pool settings here:

- https://vermaden.wordpress.com/2023/04/10/silent-fanless-del...

> 2. If you've done this before, is there a strategy that you used for the actual aggregation of the data?

I am doing something similar for 5 TB data set. I have 4 data sets. I use 2.5 5 TB drives with ZFS and GELI encrypted.

    Main Source ==> Backup @ LAN (LOCAL)
             \ \
              \ > Backup @ Internet (SSH) (REMOTE)
               \
                \> Backup @ USB (OFFLINE)
> Are there any particularly convenient IDE to USB docks?

Many - check ALIEXPRESS.COM for tons of them.

> Any good software that you would recommend for locating duplicate files?

    % cargo install czkawka_gui czkawka_cli
You may also use ZFS deduplication for the datasets you KNOW have duplicated data. There is also new ZFS feature Block Cloning - You may want to look into that as well.

> 3. What remote backup software should I use?

I use rsync(1) and rclone(1). The rsync(1) for everything file/dir based. The rclone(1) to put encrypted backups into S3 containers.

Regards, vermaden


Not exactly what you're asking for, but I think worth considering: LTO-6 data tapes.

I have about 29TB of blu-ray rips that I didn't want to risk having to re-rip (that took months!). My solution was to buy an LTO-6 tape drive on eBay, and about 100 tapes.

If you get lucky, a used LTO-6 tape drive will cost you roughly $250-$350 on ebay. The tapes themselves can be had for about $10 each, particularly if you buy a lot at once. Each tape can hold around 2TB [1]. I have all my movies backed up twice, on two tapes each. I have a label maker where I label the tapes from A-Z and I have a spreadsheet keeping track of which movies live on which tape, in case I need to restore just one.

I don't know if there are any kind of proprietary blobs in the kernel required for this, but I was able to get this working on vanilla NixOS with the `sg` kernel module enabled, and the open source LTFS implementation from HP [2].

The tapes are actually a lot faster to read and write than people think, but you can only read and write one file at a time, so you have to plan accordingly. They're also not random-access, so even though LTFS gives you a filesystem mountpoint, you probably don't want to be rsyncing files directly to them. It's not a "RAID", just a regular filesystem so when I run out of tapes, I can simply buy some more.

I keep them in a big plastic storage bin, and I have a ton of desiccant in there to protect against humidity. I haven't lost any tapes yet, and they're rated for like 15-30 years, but I want to hedge my bets a bit and desiccant is not expensive or hard to get.

Still, I am very happy with my setup. It's saved me a lot of time after I broke a RAID configuration and lost all my blu-ray rips for my Jellyfin server.

[1] They advertise like 6.5TB but that's sort of a lie; that's assuming the best-case scenario with their on-board compression. If you're backing up already-compressed stuff like video or photos, you get much closer to the 2.5TB limit, and you don't really want to run them to the edge I think, so I stop after 2TB.

[2] https://buy.hpe.com/us/en/storage/storage-software/storage-d...

ETA:

In regards to "testing", I didn't do anything too elaborate. I filled up a tape with movies, then copied them back, and compared the md5sum of each of the movies to make sure nothing had changed. They hadn't changed so I was happy enough with the results.

Also, I forgot to mention, most of the tape decks I've seen are SAS-only, so you'll either need to make sure your computer/server/whatever has a SAS port, or you'll need to find a card that has one. I think the modern LTOs have thunderbolt support, but I haven't used them. I simply found a used PCIe SAS adapter on ebay for $35 with shipping, and plugged that into my server. I think the only things I had to directly install were `mt`, `ltfs`, and enable `sg`.


Ever since watching the robo-tape floor at a company where I did an internship, I have been fascinated by high density tapes. I probably won't do this but I like the suggestion.


I think it's definitely worth considering. I was using Google's cold storage backup for my blu-rays for awhile, which was great until I needed to do a total restore from backup and it ended up costing me almost $800! Be careful with cloud computing if its your wallet on the line.

It was a bit of an upfront investment to buy all the gear, but I can restore from backup as often as I want for free, and my total cost for gear was less than $800 so it was fairly easy to justify.

I would really like to get one an LTO-9 drive, at some point, but they are way too expensive for me to be able to justify right now, even used. The tapes themselves can be had fairly reasonably but the drives (particularly the external ones) cost like $4000+, even used. In fact that's true of basically any LTO standard after 6. Once the thunderbolt drives get down to less than a grand I'll probably pick one up, but I suspect they're expensive because they're not really marketed to "consumers", but instead to entities where "cost is not an object".


A RAID-based NAS would be the obvious way to go.

Since you are not bothered about huge data throughput, software-RAID (rather than hardware-RAID) would be the cheaper way to go in general. A lot of the discussion of pros/cons of different RAID-levels that you can find online will give a lot of attention to how it affects the aggregate read/write speed; for a single-user data-archive, this is not hugely important when compared to the basic ratio of usable to redundant disk space.

You can manually set up software-RAID on most linux distros for any filesystem you like, or if you want something that does most of it for you then I can recommend unRAID (https://unraid.net/).

I have an unRAID server with 8x 3TB HDDs and 2x 1TB SSDs in which the HDDs are in a parity RAID array (I can never remember which RAID-level number that is) meaning I get 18TB of usable space with two-disk of redundancy.

The two SSDs then act as a write-cache (in mirrored RAID) so the HDDs don't need to be spun up when you add new data. This makes the whole thing very low power as the HDDs spend 99% of their time spun-down. I think my server uses about 42W on average, and that's with a bunch of web services going on as well.

unRAID provides a lot of useful utilities for managing files, some native and some via plugins. This is things such as Discord/email/Telegram integrations (so your server can notify you when a disk starts to fail) as well as things like file integrity monitoring, fan control, scheduled backup, etc.

LUKS encryption is supported if you want extra security.

Re point 3: if your unRAID OS keels over for whatever reason then the data on the drives is stored in the filesystem of your choice so you are not bound to using unRAID to recover that data.

Re point 4: you can test your system by pulling drives - unRAID should automatically emulate the data on the missing drive while you find a replacement. I have had multiple drives fail (due to a faulty HBA) and have not lost any data at all.

Re point 2: an unRAID server is accessible on your local network, and you can choose to enable SAMBA and/or NFS for different "shares." e.g. you could have your music share accessible read-only by everyone but write-protected to just you and simultaneously have your personal files share only accessible to one user.

What filesystem to use is a whole can of worms. I use XFS and it is thoroughly okay - I'm not enough of a power user for the choice of file system to make a difference to daily life and I suspect this is the case for you too.

If you want more redundancy than the standard parity array offers, then you can set up "pools" in the OS that have different RAID levels.

For your 6TB of data, an array of 4x 3TB HDDs would be a fine start, giving you 8TB of usable space with single-disk redundancy. An SSD cache pool can be added later to lower initial setup costs. With just four to six devices, chances are your motherboard will have enough SATA ports for you to not need any kind of PCI HBA or expander cards. 3TB per disk is a good trade-off point between capacity and the cost of a failed drive, IMO.

You won't need a lot of RAM - 8GB would be plenty if you don't plan on using it for hosting any web-services.

For a processor, look for a good low-power option such as an Intel Xeon E3-1220L. With a canny enough choice of components, you should be able to keep your power consumption well below 30W (while the drives are spun down). If you really only need to access this data very occasionally then there is no reason not to power the server down when not in use.

Chassis choice is also near-infinite. I have a UNAS (https://www.u-nas.com/xcart/cart.php?target=category&categor...) which is lovely, but any old PC chassis will do if you aren't fussy.

A good tip with multi-drive systems in general is to deliberately unbalance your drive-use. If you use all the drives equally, you will wear them all out at the same rate, raising the chances that multiple drives will fail within a short space of time. I tend to separate my drives by use - music on one, films on others, etc. This also keeps power consumption down as you only need to spin up one drive to access one group of files (rather than songs in an album being potentially split over multiple drives).

In terms of strats for performing the actual backup/organisation, I would find a way of mounting the existing drives to the new system one-by-one (such as by using an eSATA PCI card). Using fresh, blank drives for your NAS will mean that you can retain the ability to start again if you mess up or change your mind about what you want to keep as you won't need to modify any of the data on your existing collection while you create the new one. Generally speaking, I would avoid trying to organise the data in-situ.

It is worth re-iterating that you can achieve very similar results with a server running Ubuntu using open-source software RAID drivers or even a cheap second-hand hardware RAID controller PCI card. I am just a big fan of how low-maintenance my unRAID setup is compared to when I used to do it all manually.

It is also worth mentioning that you can get a pretty good off-the-shelf solution for this sort of thing from companies like QNAP and Synology.


rsync


I highly recommend Tarsnap [1] for your off-site backups. It has a tool that does pre-encryption with user-controlled keys, deduplication, and a reasonable price; for your use case, about $1500 a month, though more in the first month for bandwidth. I've used them for years.

1: https://www.tarsnap.com/


1500/month is a reasonable price for 6TB of personal data? Not sure I'd agree with that.


They deduplicate and compress before storing, so it may end up only taking 1.5TB.

Still, that's $375/month. Nothing to shake a stick at.

I think the intended use case of tarsnap is for <100 GB raw data that is really important to you, for which you must have an offsite backup that is (according to them) insanely backed up and safe. I could see myself using them for a curated folder of personal documents I can't afford to lose in the event of a house fire. Then again, I could probably fit all my most important documents into my 1Password vault.


If it's encrypted, it's not deduplicable and not compressable


from my understanding of the tarsnap client, it performs the deduplication then encryption locally to achieve high compression. still the cost is too high for my exact use case.


At $375 a month you could buy a new drive and have yet another copy every month.


Seems like extortion IMO - surely a VM could be spun up in any Cloud service for a fraction of the cost?


Tarsnap is an order of magnitude more expensive than S3 standard, two orders of magnitude more expensive than S3 Glacier. It's great for secure backups on the order of kilobytes or megabytes (I use it), but not for this.


18k a year is not reasonable for personal backups.


$1500 a month is four or five 18 TB drives every single month.

You could just make four copies of all your data every month, and throw hard drives in all directions.


Only if you have a sysadmin in your family who would be able to access it if something bad happens to you though.


1.5k per month is more than 23tb of hard drives for 10 years ....


1. Doesn't matter. ZFS or whatever. I use ext4, it's good enough.

2. Buy as many 3.5 inch external USB drives as you need to reach 23 TiB, then connect them all over a single USB hub. Buy the same drives again, shuck them, and stuff them in this[1]. Store them in one of these[2] when you're not doing a backup and put it somewhere outside your home. Merge them all using mergerfs.

3. rsync. If you need to think about how precious your data is, at 23 TiB, you've already lost. Just backup everything in triplicate. Don't bother setting up RAID; it's not a backup.

[1] https://www.amazon.com/-/en/dp/B07MQCDVJ2/ [2] https://www.amazon.com/-/en/dp/B087WXFFW6/




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: