Without going into too much detail, the benefit of warm -> warm -> cold -> frozen is that storage can be tuned to best suit the use case required by the environment and storage costs can be managed.
preservation
Warm Storage is where the hot and tepid barrels live. It is also the only storage where new/incoming data is written. This storage type should be the fastest storage type available for your Splunk system: Splunk requires a minimum of 800 IOPS for this storage. It makes sense to go for NVMe or SSD drives for this, if they're available. If you're using spinning drives, you'll generally need RAID 10 to achieve acceptable performance (in my experience, you need an 8-drive array to achieve those numbers on all but the fastest 15k SAS drives). SAN storage is perfectly acceptable as long as it can handle the amount of IOPS Splunk needs (test your storage first, don't just rely on your SAN provider). Shared storage via NFS should never be used with warm Splunk.
That being said, assuming we use a flash backup system as our hot storage, we may not be able to store everything on there due to the cost of those drives. This is where understanding our Splunk use case comes in: for example, if our primary use case for Splunk is a Security Operations Center (SOC), we may find that the vast majority of queries fail a week in contains data. This means that you can consider using very fast storage for these amounts of data, and understand that longer queries use slower storage and will not complete quickly.
cold storage
Cold storage is the second type of storage defined in Splunk and the last type that is actively searchable. Cold storage is generally used for data that can be searched without additional steps (such as in the PCI "out of the box" requirement), but the degradation in search performance is acceptable. This storage type is often used as a trade-off, especially when using a high cost/low capacity storage type, such as an NVMe SSD solution backend, for Splunk and keeping all searchable data on this type is a pain. inadequate actual storage. .
In some cases, you can simply assign a large storage pool to Splunk. In this case, there is no difference in performance between hot and cold storage. Splunk still rolls too cold buckets under the hood, so you only want to symbolize cold mount points for warm storage.
I haven't been able to find official recommendations on performance requirements for cold storage because unless you're using flash-based storage, it's generally more difficult to get the required storage performance from hot storage. At the very least, it should be noted that this storage must be good enough to perform the bucket rotation activity (copying index data from hot to cold) and all queries that use this data.
Since cold storage is lower-level storage, it's tempting to use network-based storage like NFS for this (NFS *doesn't* support hot storage, but cold storage does). Note that this exposes your environment to stability issues due to NFS failures: I have seen full indexers fail due to NFS cold storage access issues (leading to cascading Splunk infrastructure failures).
frozen storage
The third type of storage is significantly less convenient than hot or cold storage and can be used as a cost control solution when longer retention periods are required for regulatory or compliance reasons.
By default, Splunk does not use frozen storage: the freeze behavior deletes data once the configured cold retention period has expired. Splunk administrators can override this behavior to indicate where data is stored when scrolling to freeze.
The advantage of frozen data is that it takes up much less space than other data indexed by Splunk, typically around 15% of the size of the original data. If you thought that number sounded familiar, you were right: In the first calculation we did to determine how much space the raw data would take up when the data was indexed, it was the same as 15%. Roll to Freeze didn't magically give us more space: we removed the searchable metadata (35% of the original data size), making our data unsearchable in Splunk, and archived the raw data compressed.
For frozen data to be searchable in Splunk, the data must be rolled back. This means that the data is moved to a location in the indexer where it can be read, and Splunk rebuilds the necessary metadata to make the data searchable. Depending on the data, this can be a very time-consuming (and manual) process.
Once Splunk freezes the data, it is no longer tracked. A process outside of Splunk should be used to clean up frozen data (for example, delete it after the retention period has expired).
Therefore, frozen data is best suited as a file that is not expected to be accessed during normal business operations, but must be retained for legal or compliance reasons. Regular access to frozen data in Splunk is not a productive use of an administrator's time.
Acceleration of data models
An often overlooked component of storage compute is the space required to store data model acceleration data, which is typically consumed by Splunk Enterprise Security Suite (ES). The official calculation of this data is 3.4 times the daily usage to calculate the storage capacity of the accelerated data model data for one year. This means that one year of data model acceleration with a 10 GB/day license takes approximately 34 GB of disk space. This is not a huge consideration for small deployments, but it increases as the environment scales.
example
Now that we've covered the storage classes, let's look at some examples, starting with a small environment of 10 GB/day and then expanding to the example environment described in part 1 of the Splunking Responsibly blog (500 GB/day).
Storage calculation: single instance, 10 GB/day
Let's start with a simple example:
- Splunk license capacity: 10 GB/day
- Single server, no replication
With this capacity, we expect to consume 5 GB of actual disk space per day:
- 10 GB/day * 0.15 (raw data file) = 1.5 GB/day of disk usage
- 10 GB/day * 0.35 (metadata lookup) = 3.5 GB/day of disk usage
- 1.5 GB + 3.5 GB = 5 GB/day of disk usage
Now that we know how much space we need per day, we can use the reserved numbers to extrapolate our storage needs. Assuming we keep everything in either hot or cold storage, here's a simple calculation:
- Search*day by day
In our 10 GB/day example:
90 day retention:
- 5GB daily use * 90 days = 450GB disk - 1 year reserve:
- 5 GB of daily usage * 365 days = 1825 GB of disk
Acceleration data model (assuming 1 year):
- 10 GB * 3.4 = 34 GB
1 year retention + DMA:
- 1825 GB + 34 GB = 1859 GB
Storage Suggestions:
In this example, I recommend that the customer allocate 2TB of space for Splunk data.
If we want to reduce storage requirements by using frozen data, but need to keep 90 days of "scan-ready" data and 1 year of usable data (according to PCI DSS 10.7), our calculation would look like this:
Cold/hot storage:
- 5 GB daily usage * 90 days = 450 GB
Freezer Storage:
- 1.5 GB of daily usage * 275 days = 412.5 GB
total:
- 450 GB (hot/cold) + 412.5 GB (frozen) + 34 GB (DMA) = 896.5 GB
Storage Suggestions:
- In this example, I recommend that the customer allocate 1TB of space for Splunk data.
As you can see, using the freeze to archive 9-month data significantly reduces storage requirements at the cost of making the data readily available.
Storage compute: single instance, 500 GB/day + cluster
Now look at a more complex example. Here we are referring to the 500 GB/day implementation from my previous article. The environment consists of 8 indexers. For the sake of this example, we assume perfect data distribution across all indexers, with a 2:2 search and replication factor per site.
Let's start by calculating the basic storage requirements:
- 500 GB/day * 0.15 (raw data file) = 75 GB/day of disk usage
- 500 GB/day * 0.35 (metadata lookup) = 175 GB/day of disk usage
- 75 GB + 175 GB = 250 GB/day of disk usage (before copy)
- 75 GB * 2 (replication factor) + 175 GB * 2 (seek factor) = 500 GB/day
- 500 GB/day/8 indexers (assuming unified distribution) = 62.5 GB/day/indexer
At this point, we move on to the calculations based on the desired reserve:
90 day retention:
- 500 GB of daily usage * 90 days = 45 TB of disk
- 45 TB / 8 indexers = 5625 TB disk per indexer
1 year retention:
- 500 GB of daily usage * 365 days = 182.5 TB of disk
- 182.5TB/8 indexers = ~23TB disk per indexer
Acceleration data model (assuming 1 year):
- 500GB * 3.4 = 1700GB
1 year retention + DMA:
- 182,5 TB + 1,7 TB = 184,2 TB
- 184.2 TB/8 indexers = about 23 TB of disk per indexer (a bit more)
Storage Suggestions:
- In this example, I recommend that customers allocate more than 23TB of space for Splunk data on each indexer.
This gets more complicated if you use calculations of 90 hot/cold days + 275 freezing days, but since we've done everything we've done so far, why not?
Cold/hot storage:
- 500 GB of daily usage * 90 days = 45 TB of disk
- 45 TB / 8 indexers = 5625 TB disk per indexer
Freezer Storage:
- 75 GB of daily usage * 275 days = 20,625 TB of disk
- 20 625 TB / 8 indexers = 2578 TB disk per indexer
total:
- Full environment: - 45TB (hot/cold) + 20,625TB (frozen) + 1.7TB (DMA) = 67,325TB
- Per indexer: - 5.625 TB (hot/cold) + 2.578 TB (frozen) + 0.2125 TB (DMA) = ~8.42 TB per indexer
Storage Suggestions:
- In this example, I recommend that customers allocate approximately 9TB of space for Splunk data per indexer.
But math is hard!
It's important to understand how storage is calculated and to be able to do it manually if needed (I've used many whiteboards to do this example with clients, and if I could do it manually it would really help them "get it"). With that said, once you're comfortable with the process, there are tools out there that can make the process much easier.
The best tool I have found so far ishttps://splunk-sizing.appspot.com/This doesn't cover all possible combinations (such as some daily consumption values or calculations accelerated by the data model), but it will give you a quick estimate of how much space your client implementation will require.
That means, just like in school, being able to show your work and defend your calculations if a client has questions.
FAQs
How do I increase the size of Splunk data storage? ›
If you want to increase the size of the Splunk data storage, where do we add it? Add more space to index. Add more space to the development server.
How do I check disk space in Splunk? ›Go to Settings and click on Roles and select the role group, In our case we have select admin then go to the Resources tab and scroll down then we can see the Disk space limit.
How much space does Splunk need? ›4-10 GB of disk space.
What is the raw to index size ratio in Splunk? ›Splunk creates an index file and a raw data file when saving data. Raw data is created with a size of about 10% of the file to be imported, and index file is created with a size of about 10 to 110% of the file to be imported, so the compression rate may not be about 50%.
What is the maximum data size of Splunk index? ›Change Default maxTotalDataSizeMB – The default value of maxTotalDataSizeMB i.e. the maximum size of index is 500 GB.
Does Splunk require DB to store data? ›A main benefit of Splunk is that it uses indexes to store data, and so does not require a separate database to store its information.
How does Splunk storage work? ›Splunk stores data in indexes organized in a set of buckets by age. The hot buckets contain data that is currently being written to. This is eventually rolled to the warm, cold, and frozen buckets. The hot bucket cannot be backed up, but Splunk provides the ability to create a consistent snapshot of the other buckets.
What is the minimum free disk space reached for dispatch Splunk? ›This can be changed on any Splunk installations as explained on the online documentation: "for all installations, including forwarders, you must have a minimum of 5GB of hard disk space available in addition to the space required for any indexes." The default is 5000 and this value can be changed as explained before.
Can Splunk handle big data? ›Splunk supports extracting and organizing real-time insights from big data regardless of source. Splunk includes several add-ons to help you get more from your data, including: Splunk Analytics for Hadoop – enables you to access data stored in Hadoop clusters from virtual indexes.
How much space do you need for an index? ›Make sure there are no stairs or windows within your space and that you have cleared any obstacles from the play area—including pets! For Room Scale VR, you will need at least 2m x 1.5m (6.5ft x 5ft) of free space. Standing-only or seated setups require less space.
How do you calculate index size? ›
- The number of entries or rows.
- The number of columns in the key.
- The size of the column values, that is, the character value "abcdefghi" takes more space than "xyz." In addition, special characters and mulit-byte Unicode characters take even more space.
- The number of similar key values.
The maximum number of fields in an index key is 16.
What is the bucket size set by Splunk on 64 bit system? ›maxDataSize determines the size of your buckets. On 64-bit architecture, a setting of auto_high_volume will give you a bucket size of 10GB. On 32-bit, this is 750MB.
What type of data you should not store in a DB? ›Storing duplicate data in your database is not a good idea. It occupies more space in your database, which can be a considerable amount if you have a large database. Also, it can cause problems when you need to update the data, as you need to update it in several places.
What are the disadvantages of using Splunk? ›- Expensive for Very Large Data Volumes. Besides the amazing features of Splunk, it's a little expensive for very large data volumes. ...
- Difficult to Implement Optimizing Searches for Speed. ...
- Less Reliability. ...
- High Competition From Competitors.
There are a few different ways to get data into Splunk Cloud Platform: forwarders, HTTP Event Collector (HEC), apps and add-ons, or the Inputs Data Manager (IDM).
How to store data in Splunk? ›In Splunk, you store data in indexes made up of file buckets. These buckets contain data structures that enable Splunk to determine if the data contains terms or words. Buckets also contain compressed, raw data.
How long does Splunk hold data? ›13 months for both Standard and Enterprise.
Where are users stored in Splunk? ›User's roles are stored in $SPLUNK_HOME/etc/passwd file. So I you have also to copy that file if you want to sync users between two Splunk instances. To recap: Follow these instructions (and the info in comments) if you have to sync users and roles for the first time.
How much free disk space should I have? ›For better performance, you should leave about 20% free space on a hard drive or the PC will slow down; If you want to defrag efficiently, then, there should be at least 10% free space left. Once the hard disk is 80% full, you should consider them full.
How do I get rid of low disk space availability? ›
Free up space with Disk Cleanup
If your system doesn't have Storage Sense, you can use the Disk Cleanup tool to delete temporary files and system files from your device. In the search box on the taskbar, type disk cleanup, then select it from the results. Select the drive you want to clean up files for, then select OK.
That command is df -H. The -H switch is for human-readable format. The output of df -H will report how much space is used, available, percentage used, and the mount point of every disk attached to your system (Figure 1).
What is the maximum export limit in Splunk? ›You can export a maximum of 100 events by default.
How do I alert Splunk for high memory usage? ›If you have attributed the excessive memory usage to searches, in Splunk Web select Settings > Monitoring Console > Search > Activity > Search activity: Instance. Scroll down to the Top 20 Memory-Consuming Searches panel to identify and review the individual offending searches.
Why not to use Splunk? ›Splunk is a proprietary tool and their pricing is based on how much data you ingest into Splunk. This means that the more data you use, the more it will cost you. This is directly antithetical to how data scientists think.
How many indexes are too many indexes? ›To help performance tuners, I came up with Brent's 5 and 5 Rule: aim for around 5 indexes per table, with around 5 columns (or less) on each. This is not set in stone. It's simply based on the fact that I have 5 fingers on one hand, and 5 fingers on my other hand, so it's really easy to remember.
Does indexes require disk space for storage? ›When you perform index operations online, additional temporary disk space is required. If a clustered index is created, rebuilt, or dropped online, a temporary nonclustered index is created to map old bookmarks to new bookmarks.
How do you make an index properly? ›- Click where you want to add the index.
- On the References tab, in the Index group, click Insert Index.
- In the Index dialog box, you can choose the format for text entries, page numbers, tabs, and leader characters.
- You can change the overall look of the index by choosing from the Formats dropdown menu.
(1) Calculation of indices of items for municipalities Indices of items are calculated by dividing the price in the comparison period by the price in the base period for each municipality.
What is the formula for index value? ›Price weighted index straightforward way to calculate an index price. You just simply add all the stock prices and divide it by the number of shares and you are done.
What is the difference between size and index? ›
For example n = 10 means there are 10 elements in the array ( persons in this case ). So the size of array is 10. Person[0] represents the first element in the array. This is called indexing, as we are using an index number to address an element.
How do you find the maximum index of a list? ›If you want to return the maximum element index position from the given list, you can use the max() to get the maximum value and use this value with index() method to get the index position of the maximum element.
How do you find the maximum index difference? ›Solution steps
We initialize a variable maxDiff to track the maximum difference. Now we run the outer loop from i = 0 to n - 2. For every element A[i], we run an inner loop from i + 1 to n - 1. Whenever we find A[j] > A[i], we update maxDiff with max(A[j] - A[i], maxDiff).
index() functions to find out the index of the maximum value in a list. Use the enumerate() function to find out the index of the maximum value in a list. Use the numpy. argmax() function of the NumPy library to find out the index of the maximum value in a list.
Where and how does Splunk store data? ›In Splunk, you store data in indexes made up of file buckets. These buckets contain data structures that enable Splunk to determine if the data contains terms or words. Buckets also contain compressed, raw data.
What is the size of Splunk logs? ›This means that all the not overridden categories will get by default a value INFO . The maximum size of the logs would be 30MB (3 files of the maximum size of 10MB each). So if you have 10 Indexers, those logs could grow for each search up to 300MB .
What is the metadata size factor in Splunk? ›Around 35% of the original size of data stored is searchable metadata which Splunk uses to determine how to return search results. Combined, Splunk stores data at around a 50% reduction in size from the original.
What is the retention period of Splunk cloud? ›The retention period for indexed logs in Splunk Log Observer is 30 days.
How to convert bytes to GB in Splunk? ›The value in the bytes field is the number of bytes. We want to display this value as GB. This will require the field to be divided by 1024 to get the GB value.
What is the maximum size of Splunk bucket? ›* Defaults to 500000.
How to check queue size in Splunk? ›
You can do this by running a search like index=_internal host=host_reading_archives source=*metrics. log group=queue | timechart span=15mn perc95(current_size) by name .
What is the maximum size of metadata? ›Maximum number of custom Templates per enterprise | 500 |
---|---|
Maximum number of characters for keys and display names of Templates and attributes | 255 |
Maximum Metadata size | 512KB |
Elementary metadata captured by computers can include information about when an object was created, who created it, when it was last updated, file size, and file extension.
Does metadata affect file size? ›The file size os normally the number ob bytes of content inside the file. What's not shown in this number is metadata (filename, date of creation, access time, access rights and so on). All this stuff is stored in the filesystem tables itself.
How do I set data retention in Splunk? ›The way to 'set' a minimum retention period is to manually calculate how fast you are accumulating logs and then make sure you have allocated enough disk space to your indexes. This is found in indexes. conf and is set on a per-index level. The parameter is called FrozenTimePeriodInSecs and is expressed in seconds.
What is the default retention in Splunk? ›Extended trace retention is enabled by default for an extended retention period of 30 days.
How long does a search remain active in Splunk? ›When you run an ad hoc search and the search is finalized or completes on its own, the resulting search job has a default lifetime of 10 minutes.