Login Form
Big Data Junkie - Chip Schweiss
ZFS properties and replication
- Details
- Category: ZFS
- Last Updated on Monday, 23 May 2016 07:50
- Published on Friday, 20 May 2016 09:20
ZFS replication is handled quite well with zfs send/receive. Data snapshots are preserved and data is guarenteed correct via checksums. However, what is not replicted perfectly are the ZFS properties set on every dataset and snapshot.
When a dataset is replicated with 'zfs send -R -I' options it is expected that a fully replicated copy is created on the receive end. All zfs properties from the sending dataset are sent to the receiving dataset, however, they are hidden and not active if any of those properties were set on the receiving end.
They can be activated by calling 'zfs inherit -S'. Here is where the problems begin.
'zfs inherit -S' will only operate on one property of one dataset at a time. No recursive options and no option to operate on all properties.
If 'zfs inherit -S' is called on a property that was set locally it will effectively remove the property is it was not sent to it. Depending on why it was set locally, this could be problematic. There is no way to examine the property to know beforehand if there are any receive properties either. It's effectively operate at your own risk.
Some zfs feature work is needed. Here's what I would propose:
Integrate the patch in https://www.illumos.org/issues/2745 so that 'zfs receive' can control what properties are received and override them as needed.
Add a new option to 'zfs receive' to make all received properties active.
Add an option to 'zfs get' to expose received properties that are not active.
While not perfect, I have come up with ways to deal with this in my replication scripts. After each cycle of replication, the scripts locate all local, default and inherited properties within the dataset. Each property is updated with 'zfs inherit -S' Since there are no options to exclude or override received properties, I have an exclusion list of properties that do not get 'inherit -S'. This activates all the properties needed and updates them as they are changed. Each cycle also keeps track in the system /tmp space all the properties that have previously been propagated.
There are still problems with the method in my scripts. The number of 'zfs get' and 'zfs inherit -S' operations stack up really fast making it a significant load on the system. My scripts cannot detect when a user property is removed leaving it on the receive end. It can be removed if 'inherit -S' is called, but I cannot find a safe way to call this.
OpenZFS Outsiders
- Details
- Category: ZFS
- Last Updated on Friday, 20 May 2016 08:34
- Published on Friday, 20 May 2016 08:34
I've been using ZFS for about 5 years now primarily under Illumos. I pay close attention to the developers mailing lists and even attended one of the developer conferences in November 2014. I have some ranting to do.
OpenZFS started a couple years ago and has had several conferences where developers have showcased their ZFS work they were planing to release to the ZFS community. Many of the new features that have been demonstrated at these conferences are really needed by many.
Unfortunately, there is not much help by the reviewers to get those features incorporated. The worst case example has to be Saso's persistent L2ARC. This has been in the review cycle for several years now and in production use at Nexenta, but still not adopted by OpenZFS.
It seems only features that key people in OpenZFS and Illumos in particular are interested in make it in.
I get there is risk in new features and they need to be vetted. However, I think part of this problem is there is no testing branch of OpenZFS. OmniOS has their LTS, stable and bloody releases. Illumos/OpenZFS needs to adopt a similar approach to get new features tested and not rely on a few to do a through code review.
Code reviews even by lots of seasoned ZFS developers doesn't always catch everything. A perfect example of this was the L2ARC bug introduced with a patch to reduce the amount of RAM needed to use L2ARC. This bug made it into production systems even on the OmniOS LTS branch. It caused pool corruption that could not be repaired. It was fixed many months later after several pools were reported failing thanks to Arne Jansen. https://www.illumos.org/issues/6214
There are some great features that have been demonstrated at OpenZFS conferences and code reviews submitted. Here's my take on some of those that I would take great advantage of in my environment:
- Channel Programs - This has been demonstrated at the 2013 and 2014 OpenZFS conferences, got the best in show in 2014 and then seemed to disappear into Delphix. This advancement allows a tremendous gain in snapshot control, zfs folder creation and deletion. Without it running lots as in 100s or more zfs folders becomes very painful to manage. Channel programs will make managing the snapshots across even 1000s of folders possible.
- Device removal - There has been repeated discussions since ZFS went public about a mythical feature called block pointer rewrite. The point of block pointer rewrite would be to change the raidz level on a vdev or to remove a vdev from a pool. It has always been pushed off as too complex of an endevor and never to be written. However, there have been successful endeavors in device removal without doing block pointer rewrite. This feature rolls a vdev into the remaining vdevs on the pool effectively removing it. While not as perfect as block pointer rewrite, certainly a high value feature. Especially in the case of a raidz pool accidentally getting a single disk vdev added to it.
- Live migration - NFS seems to becoming the red headed step child in the data storage family. Everyone seems to use it but no one wants to work on advancing it. Demonstrated at the 2015 OpenZFS developer conference this simple patch would allow NFS shares to effectively be portable with zfs send/receive. Again a patch stuck without any help from the core Illumos developers.
Illumos is not a perfect code set that the firewall of code reviewers act like they are protecting. There is a continuous stream of patches fixing old bad code in there. Releasing patches that the core developers don't fully understand the impact has its risks, but often the rewards are high. It simply requires a proper vetting process so finite bugs can be found.
If this firewall of code submission is not broken down, Illumos will become the ZFS bastard child of Sun and other distributions will continue to diverge as they grow faster and faster. With Illumos controlling the OpenZFS tree, OpenZFS will crumble with it as FreeBSD and ZoL get ahead of its parent in ZFS evolution.
Everyone needs support
- Details
- Category: ZFS
- Last Updated on Friday, 01 April 2016 09:17
- Published on Thursday, 24 March 2016 08:32
When I started building ZFS systems for production usage, I didn't believe I needed any support contracts. This past year has completely changed that belief for me.
Let me start by giving a big thanks to Dan McDonald at OmniTI for his prompt and diligent response when I have had reason to call on the support contracts we took out a few years ago.
Last August I got hit with https://www.illumos.org/issues/6214 and had a production pool become non-importable. This forced me into switching several filesystems to running off the DR pool while the pool problem was diagnosed.
While Dan cannot claim he solved the underlying cause, he certainly knew the right people to get involved and help with the best plan of recovery. In this case it meant restore the entire pool from the DR copy. The primary pool could still be mounted read-only, but because of the nature of the bug, it's data was not trusted. Dan was able to contribute to the final fix of the underlying bug thanks to the efforts he put in helping restore redundancy in our storage.
This past Sunday I once again had another storage pool panic a system upon import. At this time the exact cause of the corruption is still not known except that is caused by the use of XTTRs which are common when using SMB/CIFS shares. I run Samba not Illumos SMB, which means it's not in the Illumos SMB code but in ZFS itself.
This time around the pool can be imported as long as I don't mount the affected filesystem. Mounting it even readonly will trigger the panic. So at this point it's not even accessible as a second copy of the data.
In the meantime Dan McDonald is building a module that will not panic but only throw an error when encountering the corruption. This will at least make the bad filesystem a readable filesystem and give access to a second copy of the data while replication is running for a third before blowing away the broken filesystem. There is also concern that the corruption is on the DR copy and could trip a panic if it were to be reimported. This file system is 47 TB so it will take a while until all this replication is done.
If you're running any significant OmniOS systems and going it alone. Consider what downtime or data loss would cost you. OmniTI's support prices are significantly better than application specific solutions such as Nexenta or VMware for virtualization. Their ability to support OmniOS on the same levels with better response and dedication is impeccable. They are well worth their support price.
Now when those from OmniTI read this don't think this means you can get more money out of Wash U. We're still a non-profit research organization, with limited funds, mostly from NIH grants which is why we are on this path. :) I just hope I can help by spreading a good word or two about the work that has come out of OmniTI. Keep hiring people like Dan, and certainly never let him get away.
DDN 8460 Proof of Concept
- Details
- Category: JBODs
- Last Updated on Thursday, 22 October 2015 13:36
- Published on Wednesday, 23 September 2015 09:14
This week we received a DDN 8460 on site to test for 90 days.
I'm looking for our next generation JBOD storage. I'm evaluating DDN and Raid, Inc. Both make 84 disk JBODs. DDN has chosen to send us a POC before selling us a JBOD. DataON still does not have anything comparable, an seems to be changing their focus to clustered file systems.
The DDN JBOD is constructed very similarly to the DataON JBOD. So much so, I'd wager that they share some fo the same manufacturing.
My first impression of the DDN JBOD is very positive. Many of the pains in racking the DataON JBOD have been made to work much better on this JBOD than the DNS-1660.
Mounting the rails was made easy thanks to a two part design for mounting in the front. By attaching the small hanger to the vertical rail. The JBOD's rail can be easily hung in place allowing other screws to be started without holding the weight of the rail. Without this innovation, mounting these heavy rails would be a two man job.
One detail I did not catch at first was the rail has a fold over across the entire bottom to support the JBOD. The track fits a bit recessed so the back of the JBOD can rest on the rail and the track pushed onto the JBOD. This is great, because the server lift never is perfectly level with the tracks. With this detail it makes it easy get the heavy JBOD on the rails.
Unlike many other vendors, DDN ships the disks packed separately. This added about 10-15 minutes to the installation time, but greatly diminished the risk of damaging the disks or the JBOD in transportation.
The cable management is the best I've seen on any sliding JBOD. It cleanly holds all the cables and routes them to the middle of the cabinet side. This cleans up the cable paths to the servers, however, if you don't have a wide cabinet or access to the sides pulling the cables trough can be a challenge.
While numbers on all the disk in the DataON chassis looks nice, it becomes a pain as disks die and move around to keep them consistent. DDN keeps it simple. So long as you can control the fault and identity lights you don't need the disks to be numbered.
RTFM
When I receive the shipments I looked for instructions in the packaging. There were none. It wasn't too difficult to figure out, but these types of assemblies are natural to me. I'm sure many would struggle. Turns out DDN has a really well written installation guide online. SS8460 User Guide It is well worth the read, it would have definitely saved me time.
Putting it to Use
The day after I got the system racked up, I had catastrophe with one of my pools thanks to Illumos 6214. A pool with 100 TBs of data would not import. It would however import read-only. It was on a system further down the aisle in the data center. So I unracked it and moved the DDN SS8460 with all the disks intact. I build a raidz2 pool on the 4TB disks and started cloning the pool. 2 days later all the data was cloned and I was back to two working copies of my data. I run a DR system that syncs very frequently, so the other copy was on the DR which was promoted to production when the pool went south.
Performance is exactly as expected. The pool I was sending from had ~90 disks in a raidz2 with 11 vdevs. The pool in the DDN JBOD had 7 vdevs using all 84 disks. The bottleneck on sending was consistently the source pool, most likely because of fragmentation. I can't complain about 48 hours to clone 100TBs.
Besides sending and receiving, there were also some rsync jobs done to compare data. The pool performed exactly as expected. We are in process of completing a PO to purchase this excellent JBOD for our every growing data needs.
I'm sure there will be follow up post about DDN as our relationship grows.
Other DDN products
Selling raw storage is not DDNs primary focus. They do really advanced clustered storage systems with very competitive pricing. Had I not been this far down this road with Illumos and ZFS, I'd be very serious about trying out their complete storage systems. If you're in the evaluation stage of purchasing storage systems, I can't emphasis enough that DDN is worth talking to.
Update October 2015:
We've purchase the POC from DDN. It has now been moved back to its permanent rack location. So I've had the priveledge of racking it 3 times now.
There was one thing that caught my attention shortly after moving it. It went to a critical condition after powering it on. I didn't notice it immediately, when I did it was reporting critical at the same time reporting its temperature out the back. I took this as a saying the temperature was critical. This area of the datacenter runs a bit warmer so I first dropped a temperature sensor in front of the JBOD and found the air temperature was at 73.5°F; well within spec. When examining it with santools it became clear the temperature was not critcal, but perfectly normal. What was "critical" was that I had plugged in SAS cables that were not plugged into a server yet. This was the first JBOD that I've encountered that monitors the condition of the attached SAS cables. Nice!






