File size distribution: Difference between revisions

← Older edit

File size distribution (view source)

Revision as of 04:58, 23 April 2024

56,484 bytes added , 1 month ago

→‎{{header|jq}}

Peak

2,469

edits

Revision as of 11:50, 29 September 2018 (view source) SqrtNegInf (talk \| contribs) m (→‎{{header\|Perl}}: made output match code) ← Older edit		Latest revision as of 04:58, 23 April 2024 (view source) Peak (talk \| contribs) (→‎{{header\|jq}})
(70 intermediate revisions by 19 users not shown)
Line 1: {{~~draft~~ task}} ;Task: Beginning from the current directory, or optionally from a directory specified as a command-line argument, determine how many files there are of various sizes in a directory hierarchy. My suggestion is to sort by logarithmn of file size, since a few bytes here or there, or even a factor of two or three, may not be that significant. Don't forget that empty files may exist, to serve as a marker. Is your file system predominantly devoted to a large number of smaller files, or a smaller number of huge files? <br><br> =={{header\|Action!}}== DOS 2.5 returns file size in number of sectors. {{libheader\|Action! Tool Kit}} <syntaxhighlight lang="action!">INCLUDE "D2:PRINTF.ACT" ;from the Action! Tool Kit PROC SizeDistribution(CHAR ARRAY filter INT ARRAY limits,counts BYTE count) CHAR ARRAY line(255),tmp(4) INT size BYTE i,dev=[1] FOR i=0 TO count-1 DO counts(i)=0 OD Close(dev) Open(dev,filter,6) DO InputSD(dev,line) IF line(0)=0 THEN EXIT FI SCopyS(tmp,line,line(0)-3,line(0)) size=ValI(tmp) FOR i=0 TO count-1 DO IF size<limits(i) THEN counts(i)==+1 EXIT FI OD OD Close(dev) RETURN PROC GenerateLimits(INT ARRAY limits BYTE count) BYTE i INT l l=1 FOR i=0 TO count-1 DO limits(i)=l l==LSH 1 IF l>1000 THEN l=1000 FI OD RETURN PROC PrintBar(INT len,max,size) INT i,count count=4lensize/max IF count=0 AND len>0 THEN count=1 FI FOR i=0 TO count/4-1 DO Put(160) OD i=count MOD 4 IF i=1 THEN Put(22) ELSEIF i=2 THEN Put(25) ELSEIF i=3 THEN Put(130) FI RETURN PROC PrintResult(CHAR ARRAY filter INT ARRAY limits,counts BYTE count) BYTE i CHAR ARRAY tmp(5) INT min,max,total total=0 max=0 FOR i=0 TO count-1 DO total==+counts(i) IF counts(i)>max THEN max=counts(i) FI OD PrintF("File size distribution of ""%S"" in sectors:%E",filter) PutE() PrintE("From To Count Perc") min=0 FOR i=0 TO count-1 DO StrI(min,tmp) PrintF("%4S ",tmp) StrI(limits(i)-1,tmp) PrintF("%3S ",tmp) StrI(counts(i),tmp) PrintF("%3S ",tmp) StrI(counts(i)100/total,tmp) PrintF("%3S%% ",tmp) PrintBar(counts(i),max,17) PutE() min=limits(i) OD RETURN PROC Main() DEFINE LIMITCOUNT="11" CHAR ARRAY filter="H1:." INT ARRAY limits(LIMITCOUNT),counts(LIMITCOUNT) Put(125) PutE() ;clear the screen GenerateLimits(limits,LIMITCOUNT) SizeDistribution(filter,limits,counts,LIMITCOUNT) PrintResult(filter,limits,counts,LIMITCOUNT) RETURN</syntaxhighlight> {{out}} [https://gitlab.com/amarok8bit/action-rosetta-code/-/raw/master/images/File_size_distribution.png Screenshot from Atari 8-bit computer] <pre> File size distribution of "H1:." in sectors: From To Count Perc 0 0 2 0% ▌ 1 1 20 3% █▌ 2 3 44 8% ███▌ 4 7 195 37% █████████████████ 8 15 183 35% ███████████████▌ 16 31 67 12% █████▌ 32 63 6 1% ▌ 64 127 0 0% 128 255 0 0% 256 511 0 0% 512 999 1 0% ▌ </pre> =={{header\|Ada}}== {{libheader\|Dir_Iterators}} <syntaxhighlight lang="ada">with Ada.Numerics.Elementary_Functions; with Ada.Directories; use Ada.Directories; with Ada.Strings.Fixed; use Ada.Strings; with Ada.Command_Line; use Ada.Command_Line; with Ada.Text_IO; use Ada.Text_IO; with Dir_Iterators.Recursive; procedure File_Size_Distribution is type Exponent_Type is range 0 .. 18; type File_Count is range 0 .. Long_Integer'Last; Counts : array (Exponent_Type) of File_Count := (others => 0); Non_Zero_Index : Exponent_Type := 0; Directory_Name : constant String := (if Argument_Count = 0 then "." else Argument (1)); Directory_Walker : Dir_Iterators.Recursive.Recursive_Dir_Walk := Dir_Iterators.Recursive.Walk (Directory_Name); begin if not Exists (Directory_Name) or else Kind (Directory_Name) /= Directory then Put_Line ("Directory does not exist"); return; end if; for Directory_Entry of Directory_Walker loop declare use Ada.Numerics.Elementary_Functions; Size_Of_File : File_Size; Exponent : Exponent_Type; begin if Kind (Directory_Entry) = Ordinary_File then Size_Of_File := Size (Directory_Entry); if Size_Of_File = 0 then Counts (0) := Counts (0) + 1; else Exponent := Exponent_Type (Float'Ceiling (Log (Float (Size_Of_File), Base => 10.0))); Counts (Exponent) := Counts (Exponent) + 1; end if; end if; end; end loop; for I in reverse Counts'Range loop if Counts (I) /= 0 then Non_Zero_Index := I; exit; end if; end loop; for I in Counts'First .. Non_Zero_Index loop Put ("Less than 10"); Put (Fixed.Trim (Exponent_Type'Image (I), Side => Left)); Put (": "); Put (File_Count'Image (Counts (I))); New_Line; end loop; end File_Size_Distribution;</syntaxhighlight> {{out}} <pre>Less than 100: 8 Less than 101: 0 Less than 102: 18 Less than 103: 88 Less than 104: 39 Less than 105: 8 Less than 106: 2 Less than 107: 1</pre> =={{header\|C}}== The platform independent way to get the file size in C involves opening every file and reading the size. The implementation below works for Windows and utilizes command scripts to get size information quickly even for a large number of files, recursively traversing a large number of directories. Both textual and graphical ( ASCII ) outputs are shown. The same can be done for Linux by a combination of the find, ls and stat commands and my plan was to make it work on both OS types, but I don't have access to a Linux system right now. This would also mean either abandoning scaling the graphical output in order to fit the console buffer or porting that as well, thus including windows.h selectively. ===Windows=== <syntaxhighlight lang="c"> ~~<lang C>~~ #include<windows.h> #include<string.h> Line 30 ⟶ 224: double scale; FILE fp; if(argC==1) printf("Usage : %s <followed by directory to start search from(. for current dir), followed by \n optional parameters (T or G) to show text or graph output>",argV[0]); Line 43 ⟶ 237: sprintf(commandString,"forfiles /p %s /s /c \"cmd /c echo @fsize\" 2>&1",startPath); } else if(strlen(argV[1])==1 && argV[1][0]=='.') strcpy(commandString,"forfiles /s /c \"cmd /c echo @fsize\" 2>&1"); else sprintf(commandString,"forfiles /p %s /s /c \"cmd /c echo @fsize\" 2>&1",argV[1]); Line 58 ⟶ 252: fileSizeLog[strlen(str)]++; } if(argC==2 \|\| (argC==3 && (argV[2][0]=='t'\|\|argV[2][0]=='T'))){ for(i=0;i<MAXORDER;i++){ Line 64 ⟶ 258: } } else if(argC==3 && (argV[2][0]=='g'\|\|argV[2][0]=='G')){ CONSOLE_SCREEN_BUFFER_INFO csbi; Line 72 ⟶ 266: max = fileSizeLog[0]; for(i=1;i<MAXORDER;i++) (fileSizeLog[i]>max)?max=fileSizeLog[i]:max; (max < csbi.dwSize.X)?(scale=1):(scale=(1.0(csbi.dwSize.X-50))/max); for(i=0;i<MAXORDER;i++){ printf("\nSize Order < 10^%2d bytes \|",i); Line 85 ⟶ 279: } } } return 0; } } </syntaxhighlight> ~~</lang>~~ Invocation and textual output : <pre> Line 152 ⟶ 346: </pre> Note that it is possible to track files up to 10^24 (Yottabyte) in size with this implementation, but if you have a file that large, you shouldn't be needing such programs. :) ===POSIX=== {{libheader\|POSIX}} This works on macOS 10.15. It should be OK for Linux as well. <syntaxhighlight lang="c">#include <ftw.h> #include <locale.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> static const uintmax_t sizes[] = { 0, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000, 10000000000 }; static const size_t nsizes = sizeof(sizes)/sizeof(sizes[0]); static uintmax_t count[nsizes + 1] = { 0 }; static uintmax_t files = 0; static uintmax_t total_size = 0; static int callback(const char file, const struct stat* sp, int flag) { if (flag == FTW_F) { uintmax_t file_size = sp->st_size; ++files; total_size += file_size; size_t index = 0; for (; index < nsizes && sizes[index] < file_size; ++index); ++count[index]; } else if (flag == FTW_DNR) { fprintf(stderr, "Cannot read directory %s.\n", file); } return 0; } int main(int argc, char** argv) { setlocale(LC_ALL, ""); const char* directory = argc > 1 ? argv[1] : "."; if (ftw(directory, callback, 512) != 0) { perror(directory); return EXIT_FAILURE; } printf("File size distribution for '%s':\n", directory); for (size_t i = 0; i <= nsizes; ++i) { if (i == nsizes) printf("> %'lu", sizes[i - 1]); else printf("%'16lu", sizes[i]); printf(" bytes: %'lu\n", count[i]); } printf("Number of files: %'lu\n", files); printf("Total file size: %'lu\n", total_size); return EXIT_SUCCESS; }</syntaxhighlight> {{out}} <pre> File size distribution for '.': 0 bytes: 0 1,000 bytes: 3 10,000 bytes: 111 100,000 bytes: 2,457 1,000,000 bytes: 2,645 10,000,000 bytes: 2,483 100,000,000 bytes: 172 1,000,000,000 bytes: 3 10,000,000,000 bytes: 0 > 10,000,000,000 bytes: 0 Number of files: 7,874 Total file size: 11,963,566,673 </pre> =={{header\|C++}}== <syntaxhighlight lang="cpp">#include <algorithm> #include <array> #include <filesystem> #include <iomanip> #include <iostream> void file_size_distribution(const std::filesystem::path& directory) { constexpr size_t n = 9; constexpr std::array<std::uintmax_t, n> sizes = { 0, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000, 10000000000 }; std::array<size_t, n + 1> count = { 0 }; size_t files = 0; std::uintmax_t total_size = 0; std::filesystem::recursive_directory_iterator iter(directory); for (const auto& dir_entry : iter) { if (dir_entry.is_regular_file() && !dir_entry.is_symlink()) { std::uintmax_t file_size = dir_entry.file_size(); total_size += file_size; auto i = std::lower_bound(sizes.begin(), sizes.end(), file_size); size_t index = std::distance(sizes.begin(), i); ++count[index]; ++files; } } std::cout << "File size distribution for " << directory << ":\n"; for (size_t i = 0; i <= n; ++i) { if (i == n) std::cout << "> " << sizes[i - 1]; else std::cout << std::setw(16) << sizes[i]; std::cout << " bytes: " << count[i] << '\n'; } std::cout << "Number of files: " << files << '\n'; std::cout << "Total file size: " << total_size << " bytes\n"; } int main(int argc, char** argv) { std::cout.imbue(std::locale("")); try { const char* directory(argc > 1 ? argv[1] : "."); std::filesystem::path path(directory); if (!is_directory(path)) { std::cerr << directory << " is not a directory.\n"; return EXIT_FAILURE; } file_size_distribution(path); } catch (const std::exception& ex) { std::cerr << ex.what() << '\n'; return EXIT_FAILURE; } return EXIT_SUCCESS; }</syntaxhighlight> {{out}} <pre> File size distribution for ".": 0 bytes: 0 1,000 bytes: 3 10,000 bytes: 111 100,000 bytes: 2,457 1,000,000 bytes: 2,645 10,000,000 bytes: 2,483 100,000,000 bytes: 172 1,000,000,000 bytes: 3 10,000,000,000 bytes: 0 > 10,000,000,000 bytes: 0 Number of files: 7,874 Total file size: 11,963,566,673 bytes </pre> =={{header\|Delphi}}== {{libheader\| System.SysUtils}} {{libheader\| System.Math}} {{libheader\| Winapi.Windows}} {{Trans\|Go}} <syntaxhighlight lang="delphi"> program File_size_distribution; {$APPTYPE CONSOLE} uses System.SysUtils, System.Math, Winapi.Windows; function Commatize(n: Int64): string; begin result := n.ToString; if n < 0 then delete(result, 1, 1); var le := result.Length; var i := le - 3; while i >= 1 do begin Insert(',', result, i + 1); dec(i, 3); end; if n >= 0 then exit; Result := '-' + result; end; procedure Walk(Root: string; walkFunc: TProc<string, TWin32FindData>); overload; var rec: TWin32FindData; h: THandle; directory, PatternName: string; begin if not Assigned(walkFunc) then exit; Root := IncludeTrailingPathDelimiter(Root); h := FindFirstFile(Pchar(Root + '.'), rec); if (INVALID_HANDLE_VALUE <> h) then repeat if rec.cFileName[0] = '.' then Continue; walkFunc(directory, rec); if ((rec.dwFileAttributes and FILE_ATTRIBUTE_DIRECTORY) = FILE_ATTRIBUTE_DIRECTORY) and (rec.cFileName[0] <> '.') then Walk(Root + rec.cFileName, walkFunc); until not FindNextFile(h, rec); FindClose(h); end; procedure FileSizeDistribution(root: string); var sizes: TArray<Integer>; files, directories, totalSize, size, i: UInt64; c: string; begin SetLength(sizes, 12); files := 0; directories := 0; totalSize := 0; size := 0; Walk(root, procedure(path: string; info: TWin32FindData) var logSize: Extended; index: integer; begin inc(files); if (info.dwFileAttributes and FILE_ATTRIBUTE_DIRECTORY) = FILE_ATTRIBUTE_DIRECTORY then inc(directories); size := info.nFileSizeHigh shl 32 + info.nFileSizeLow; if size = 0 then begin sizes[0] := sizes[0] + 1; exit; end; inc(totalSize, size); logSize := Log10(size); index := Floor(logSize); sizes[index] := sizes[index] + 1; end); writeln('File size distribution for "', root, '" :-'#10); for i := 0 to High(sizes) do begin if i = 0 then write(' ') else write('+ '); writeln(format('Files less than 10 ^ %-2d bytes : %5d', [i, sizes[i]])); end; writeln(' -----'); writeln('= Total number of files : ', files: 5); writeln(' including directories : ', directories: 5); c := commatize(totalSize); writeln(#10' Total size of files : ', c, 'bytes'); end; begin fileSizeDistribution('.'); readln; end.</syntaxhighlight> =={{header\|Factor}}== {{works with\|Factor\|0.99 2020-03-02}} <syntaxhighlight lang="factor">USING: accessors assocs formatting io io.directories.search io.files.types io.pathnames kernel math math.functions math.statistics namespaces sequences ; : classify ( m -- n ) [ 0 ] [ log10 >integer 1 + ] if-zero ; : file-size-histogram ( path -- assoc ) recursive-directory-entries [ type>> +directory+ = ] reject [ size>> classify ] map histogram ; current-directory get file-size-histogram dup [ "Count of files < 10^%d bytes: %4d\n" printf ] assoc-each nl values sum "Total files: %d\n" printf</syntaxhighlight> {{out}} <pre> Count of files < 10^0 bytes: 20 Count of files < 10^1 bytes: 742 Count of files < 10^2 bytes: 3881 Count of files < 10^3 bytes: 2388 Count of files < 10^4 bytes: 3061 Count of files < 10^5 bytes: 486 Count of files < 10^6 bytes: 78 Count of files < 10^7 bytes: 27 Count of files < 10^8 bytes: 3 Count of files < 10^9 bytes: 1 Total files: 10687 </pre> =={{header\|Go}}== {{trans\|Kotlin}} <syntaxhighlight lang="go">package main import ( "fmt" "log" "math" "os" "path/filepath" ) func commatize(n int64) string { s := fmt.Sprintf("%d", n) if n < 0 { s = s[1:] } le := len(s) for i := le - 3; i >= 1; i -= 3 { s = s[0:i] + "," + s[i:] } if n >= 0 { return s } return "-" + s } func fileSizeDistribution(root string) { var sizes [12]int files := 0 directories := 0 totalSize := int64(0) walkFunc := func(path string, info os.FileInfo, err error) error { if err != nil { return err } files++ if info.IsDir() { directories++ } size := info.Size() if size == 0 { sizes[0]++ return nil } totalSize += size logSize := math.Log10(float64(size)) index := int(math.Floor(logSize)) sizes[index+1]++ return nil } err := filepath.Walk(root, walkFunc) if err != nil { log.Fatal(err) } fmt.Printf("File size distribution for '%s' :-\n\n", root) for i := 0; i < len(sizes); i++ { if i == 0 { fmt.Print(" ") } else { fmt.Print("+ ") } fmt.Printf("Files less than 10 ^ %-2d bytes : %5d\n", i, sizes[i]) } fmt.Println(" -----") fmt.Printf("= Total number of files : %5d\n", files) fmt.Printf(" including directories : %5d\n", directories) c := commatize(totalSize) fmt.Println("\n Total size of files :", c, "bytes") } func main() { fileSizeDistribution("./") }</syntaxhighlight> {{out}} <pre> File size distribution for './' :- Files less than 10 ^ 0 bytes : 0 + Files less than 10 ^ 1 bytes : 0 + Files less than 10 ^ 2 bytes : 8 + Files less than 10 ^ 3 bytes : 98 + Files less than 10 ^ 4 bytes : 163 + Files less than 10 ^ 5 bytes : 18 + Files less than 10 ^ 6 bytes : 8 + Files less than 10 ^ 7 bytes : 18 + Files less than 10 ^ 8 bytes : 1 + Files less than 10 ^ 9 bytes : 0 + Files less than 10 ^ 10 bytes : 0 + Files less than 10 ^ 11 bytes : 0 ----- = Total number of files : 314 including directories : 7 Total size of files : 74,205,408 bytes </pre> =={{header\|Haskell}}== <p> Uses a grouped frequency distribution. Program arguments are optional. Arguments include starting directory and initial frequency distribution group size. After the first frequency distribution is computed it further breaks it down for any group that exceeds 25% of the total file count, when possible. </p> <syntaxhighlight lang="haskell">{-# LANGUAGE LambdaCase #-} import Control.Concurrent (forkIO, setNumCapabilities) import Control.Concurrent.Chan (Chan, newChan, readChan, writeChan, writeList2Chan) import Control.Exception (IOException, catch) import Control.Monad (filterM, forever, join, replicateM, replicateM_, (>=>)) import Control.Parallel.Strategies (parTraversable, rseq, using, withStrategy) import Data.Char (isDigit) import Data.List (find, sort) import qualified Data.Map.Strict as Map import GHC.Conc (getNumProcessors) import System.Directory (doesDirectoryExist, doesFileExist, listDirectory, pathIsSymbolicLink) import System.Environment (getArgs) import System.FilePath.Posix ((</>)) import System.IO (FilePath, IOMode (ReadMode), hFileSize, hPutStrLn, stderr, withFile) import Text.Printf (hPrintf, printf) data Item = File FilePath Integer \| Folder FilePath deriving (Show) type FGKey = (Integer, Integer) type FrequencyGroup = (FGKey, Integer) type FrequencyGroups = Map.Map FGKey Integer newFrequencyGroups :: FrequencyGroups newFrequencyGroups = Map.empty fileSizes :: [Item] -> [Integer] fileSizes = foldr f [] where f (File _ n) acc = n:acc f _ acc = acc folders :: [Item] -> [FilePath] folders = foldr f [] where f (Folder p) acc = p:acc f _ acc = acc totalBytes :: [Item] -> Integer totalBytes = sum . fileSizes counts :: [Item] -> (Integer, Integer) counts = foldr (\x (a, b) -> case x of File _ _ -> (succ a, b) Folder _ -> (a, succ b)) (0, 0) -- \|Creates 'FrequencyGroups' from the provided size and data set. frequencyGroups :: Int -- ^ Desired number of frequency groups. -> [Integer] -- ^ List of collected file sizes. Must be sorted. -> FrequencyGroups -- ^ Returns a 'FrequencyGroups' for the file sizes. frequencyGroups _ [] = newFrequencyGroups frequencyGroups totalGroups xs \| length xs == 1 = Map.singleton (head xs, head xs) 1 \| otherwise = foldr placeGroups newFrequencyGroups xs `using` parTraversable rseq where range = maximum xs - minimum xs groupSize = succ $ ceiling $ realToFrac range / realToFrac totalGroups groups = takeWhile (<=groupSize + maximum xs) $ iterate (+groupSize) 0 groupMinMax = zip groups (pred <$> tail groups) findGroup n = find (\(low, high) -> n >= low && n <= high) incrementCount (Just n) = Just (succ n) -- Update count for range. incrementCount Nothing = Just 1 -- Insert new range with initial count. placeGroups n fgMap = case findGroup n groupMinMax of Just k -> Map.alter incrementCount k fgMap Nothing -> fgMap -- Should never happen. expandGroups :: Int -- ^ Desired number of frequency groups. -> [Integer] -- ^ List of collected file sizes. -> Integer -- ^ Computed frequency group limit. -> FrequencyGroups -- ^ Expanded 'FrequencyGroups' expandGroups gsize fileSizes groupThreshold \| groupThreshold > 0 = loop 15 $ frequencyGroups gsize sortedFileSizes \| otherwise = frequencyGroups gsize sortedFileSizes where sortedFileSizes = sort fileSizes loop 0 gs = gs -- break out in case we can't go below threshold loop n gs \| all (<= groupThreshold) $ Map.elems gs = gs \| otherwise = loop (pred n) (expand gs) expand :: FrequencyGroups -> FrequencyGroups expand = foldr f . withStrategy (parTraversable rseq) <> Map.mapWithKey groupsFromGroup . Map.filter (> groupThreshold) where f :: Maybe (FGKey, FrequencyGroups) -- ^ expanded frequency group -> FrequencyGroups -- ^ accumulator -> FrequencyGroups -- ^ merged accumulator f (Just (k, fg)) acc = Map.union (Map.delete k acc) fg f Nothing acc = acc groupsFromGroup :: FGKey -- ^ Group Key -> Integer -- ^ Count -> Maybe (FGKey, FrequencyGroups) -- ^ Returns expanded 'FrequencyGroups' with base key it replaces. groupsFromGroup (min, max) count \| length range > 1 = Just ((min, max), frequencyGroups gsize range) \| otherwise = Nothing where range = filter (\n -> n >= min && n <= max) sortedFileSizes displaySize :: Integer -> String displaySize n \| n <= 2^10 = printf "%8dB " n \| n >= 2^10 && n <= 2^20 = display (2^10) "KB" \| n >= 2^20 && n <= 2^30 = display (2^20) "MB" \| n >= 2^30 && n <= 2^40 = display (2^30) "GB" \| n >= 2^40 && n <= 2^50 = display (2^40) "TB" \| otherwise = "Too large!" where display :: Double -> String -> String display b = printf "%7.2f%s " (realToFrac n / b) displayFrequency :: Integer -> FrequencyGroup -> IO () displayFrequency filesCount ((min, max), count) = do printf "%s <-> %s" (displaySize min) (displaySize max) printf "= %-10d %6.3f%%: %-5s\n" count percentage bars where percentage :: Double percentage = (realToFrac count / realToFrac filesCount) 100 size = round percentage bars \| size == 0 = "▍" \| otherwise = replicate size '█' folderWorker :: Chan FilePath -> Chan [Item] -> IO () folderWorker folderChan resultItemsChan = forever (readChan folderChan >>= collectItems >>= writeChan resultItemsChan) collectItems :: FilePath -> IO [Item] collectItems folderPath = catch tryCollect $ \e -> do hPrintf stderr "Skipping: %s\n" $ show (e :: IOException) pure [] where tryCollect = (fmap (folderPath </>) <$> listDirectory folderPath) >>= mapM (\p -> doesDirectoryExist p >>= \case True -> pure $ Folder p False -> File p <$> withFile p ReadMode hFileSize) parallelItemCollector :: FilePath -> IO [Item] parallelItemCollector folder = do wCount <- getNumProcessors setNumCapabilities wCount printf "Using %d worker threads\n" wCount folderChan <- newChan resultItemsChan <- newChan replicateM_ wCount (forkIO $ folderWorker folderChan resultItemsChan) loop folderChan resultItemsChan [Folder folder] where loop :: Chan FilePath -> Chan [Item] -> [Item] -> IO [Item] loop folderChan resultItemsChan xs = do regularFolders <- filterM (pathIsSymbolicLink >=> (pure . not)) $ folders xs if null regularFolders then pure [] else do writeList2Chan folderChan regularFolders childItems <- replicateM (length regularFolders) (readChan resultItemsChan) result <- mapM (loop folderChan resultItemsChan) childItems pure (join childItems <> join result) parseArgs :: [String] -> Either String (FilePath, Int) parseArgs (x:y:xs) \| all isDigit y = Right (x, read y) \| otherwise = Left "Invalid frequency group size" parseArgs (x:xs) = Right (x, 4) parseArgs _ = Right (".", 4) main :: IO () main = parseArgs <$> getArgs >>= \case Left errorMessage -> hPutStrLn stderr errorMessage Right (path, groupSize) -> do items <- parallelItemCollector path let (fileCount, folderCount) = counts items printf "Total files: %d\nTotal folders: %d\n" fileCount folderCount printf "Total size: %s\n" $ displaySize $ totalBytes items printf "\nDistribution:\n\n%9s <-> %9s %7s\n" "From" "To" "Count" putStrLn $ replicate 46 '-' let results = expandGroups groupSize (fileSizes items) (groupThreshold fileCount) mapM_ (displayFrequency fileCount) $ Map.assocs results where groupThreshold = round . (0.25) . realToFrac</syntaxhighlight> {{out}} <pre style="height: 50rem;">$ filedist ~/Music Using 4 worker threads Total files: 688 Total folders: 663 Total size: 985.85MB Distribution: From <-> To Count ---------------------------------------------- 0B <-> 80B = 7 1.017%: █ 81B <-> 161B = 74 10.756%: ███████████ 162B <-> 242B = 112 16.279%: ████████████████ 243B <-> 323B = 99 14.390%: ██████████████ 323B <-> 645B = 23 3.343%: ███ 646B <-> 968B = 2 0.291%: ▍ 969B <-> 1.26KB = 1 0.145%: ▍ 3.19KB <-> 6.38KB = 12 1.744%: ██ 6.38KB <-> 9.58KB = 22 3.198%: ███ 9.58KB <-> 12.77KB = 12 1.744%: ██ 13.52KB <-> 27.04KB = 15 2.180%: ██ 27.04KB <-> 40.57KB = 6 0.872%: █ 40.57KB <-> 54.09KB = 22 3.198%: ███ 54.20KB <-> 108.41KB = 99 14.390%: ██████████████ 108.41KB <-> 162.61KB = 23 3.343%: ███ 162.61KB <-> 216.81KB = 8 1.163%: █ 236.46KB <-> 472.93KB = 3 0.436%: ▍ 709.39KB <-> 945.85KB = 44 6.395%: ██████ 3.30MB <-> 4.96MB = 4 0.581%: █ 4.96MB <-> 6.61MB = 21 3.052%: ███ 6.67MB <-> 13.33MB = 72 10.465%: ██████████ 13.33MB <-> 20.00MB = 6 0.872%: █ 20.00MB <-> 26.66MB = 1 0.145%: ▍ $ filedist ~/Music 10 Using 4 worker threads Total files: 688 Total folders: 663 Total size: 985.85MB Distribution: From <-> To Count ---------------------------------------------- 0B <-> 88B = 7 1.017%: █ 89B <-> 177B = 75 10.901%: ███████████ 178B <-> 266B = 156 22.674%: ███████████████████████ 267B <-> 355B = 57 8.285%: ████████ 356B <-> 444B = 20 2.907%: ███ 801B <-> 889B = 2 0.291%: ▍ 959B <-> 1.87KB = 1 0.145%: ▍ 3.75KB <-> 4.68KB = 1 0.145%: ▍ 4.68KB <-> 5.62KB = 1 0.145%: ▍ 5.62KB <-> 6.55KB = 11 1.599%: ██ 6.56KB <-> 7.49KB = 10 1.453%: █ 7.49KB <-> 8.43KB = 4 0.581%: █ 8.43KB <-> 9.36KB = 7 1.017%: █ 9.43KB <-> 18.85KB = 21 3.052%: ███ 18.85KB <-> 28.28KB = 6 0.872%: █ 28.28KB <-> 37.71KB = 4 0.581%: █ 37.71KB <-> 47.13KB = 12 1.744%: ██ 47.13KB <-> 56.56KB = 16 2.326%: ██ 56.56KB <-> 65.99KB = 23 3.343%: ███ 65.99KB <-> 75.41KB = 26 3.779%: ████ 75.41KB <-> 84.84KB = 15 2.180%: ██ 84.84KB <-> 94.27KB = 17 2.471%: ██ 94.59KB <-> 189.17KB = 42 6.105%: ██████ 189.17KB <-> 283.76KB = 4 0.581%: █ 283.76KB <-> 378.35KB = 2 0.291%: ▍ 851.28KB <-> 945.87KB = 44 6.395%: ██████ 2.67MB <-> 5.33MB = 5 0.727%: █ 5.33MB <-> 8.00MB = 41 5.959%: ██████ 8.00MB <-> 10.67MB = 35 5.087%: █████ 10.67MB <-> 13.33MB = 16 2.326%: ██ 13.33MB <-> 16.00MB = 3 0.436%: ▍ 16.00MB <-> 18.67MB = 3 0.436%: ▍ 24.00MB <-> 26.66MB = 1 0.145%: ▍ </pre> =={{header\|J}}== We can get file sizes of all files under a specific path by inspecting the last column from dirtree. For example, the sizes of the files under the user's home directory would be <tt>;{:\|:dirtree '~'</tt> From there, we can bucket them by factors of ten, then display the limiting size of each bucket along with the number of files contained (we'll sort them, for legibility): <syntaxhighlight lang="j"> ((10x^~.),.#/.~) <.10 ^.1>. /:~;{:\|:dirtree '~' 1 2 10 8 100 37 1000 49 10000 20 100000 9 1000000 4 10000000 4</syntaxhighlight> =={{header\|Java}}== <syntaxhighlight lang="java"> import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.util.HashMap; import java.util.List; import java.util.Map; public final class FileSizeDistribution { public static void main(String[] aArgs) throws IOException { List<Path> fileNames = Files.list(Path.of(".")) .filter( file -> ! Files.isDirectory(file) ) .map(Path::getFileName) .toList(); Map<Integer, Integer> fileSizes = new HashMap<Integer, Integer>(); for ( Path path : fileNames ) { fileSizes.merge(String.valueOf(Files.size(path)).length(), 1, Integer::sum); } final int fileCount = fileSizes.values().stream().mapToInt(Integer::valueOf).sum(); System.out.println("File size distribution for directory \".\":" + System.lineSeparator()); System.out.println("File size in bytes \| Number of files \| Percentage"); System.out.println("-------------------------------------------------"); for ( int key : fileSizes.keySet() ) { final int value = fileSizes.get(key); System.out.println(String.format("%s%d%s%d%15d%15.1f%%", " 10^", ( key - 1 ), " to 10^", key, value, ( 100.0 value ) / fileCount)); } } } </syntaxhighlight> {{ out }} <pre> File size distribution for directory ".": File size in bytes \| Number of files \| Percentage ------------------------------------------------- 10^0 to 10^1 1 0.2% 10^1 to 10^2 1 0.2% 10^2 to 10^3 5 1.1% 10^3 to 10^4 3 0.6% 10^4 to 10^5 161 34.0% 10^5 to 10^6 196 41.4% 10^6 to 10^7 98 20.7% 10^7 to 10^8 9 1.9% </pre> =={{header\|jq}}== '''Works with jq, the C implementation of jq''' '''Works with gojq, the Go implementation of jq''' '''Works with jaq, the Rust implementation of jq''' This entry illustrates how jq plays nicely with other command-line tools; in this case jc (https://kellyjonbrazil.github.io/jc) is used to JSONify the output of `ls -Rl`. (jq could also be used to parse the raw output of `ls`, but it would no doubt be tricky to achieve portability.) The invocation of jc and jq would be along the following lines: <pre> jc --ls -lR \| jq -c -f file-size-distribution.jq </pre> In the present case, the output from the call to `histogram` is a stream of [category, count] pairs beginning with [0, _] showing the number of files of size 0; thereafter, the boundaries of the categories are defined logarithmically, i.e. a file of size of $n is assigned to the category `1 + ($n \| log10 \| trunc)`. The output shown below for an actual directory tree suggests a unimodal distribution of file sizes. <syntaxhighlight lang="jq"> # bag of words def bow(stream): reduce stream as $word ({}; .[($word\|tostring)] += 1); # `stream` is expected to be a stream of non-negative numbers or numeric strings. # The output is a stream of [bucket, count] pairs, sorted by the value of `bucket`. # No sorting except for the sorting of these bucket boundaries takes place. def histogram(stream): bow(stream) \| to_entries \| map( [(.key \| tonumber), .value] ) \| sort_by(.[0]) \| .[]; histogram(.[] \| .size \| if . == 0 then 0 else 1 + (log10 \| trunc) end) </syntaxhighlight> {{output}} <pre> [0,9] [1,67] [2,616] [3,6239] [4,3679] [5,213] [6,56] [7,40] [8,20] [9,4] [10,1] </pre> =={{header\|Julia}}== {{works with\|Julia\|0.6}} <~~lang~~syntaxhighlight lang="julia">using Humanize function sizelist(path::AbstractString) Line 182 ⟶ 1,150: end main(".")</~~lang~~syntaxhighlight> {{out}} <pre>filesizes: - between 0.0 B and 1.0 B bytes: 0 - between 1.0 B and 10.0 B bytes: 1 Line 200 ⟶ 1,168: =={{header\|Kotlin}}== <~~lang~~syntaxhighlight lang="scala">// version 1.2.10 import java.io.File Line 247 ⟶ 1,215: fun main(args: Array<String>) { fileSizeDistribution("./") // current directory }</~~lang~~syntaxhighlight> {{out}} Line 272 ⟶ 1,240: Number of inaccessible files : 0 </pre> =={{header\|Lang}}== {{libheader\|lang-io-module}} <syntaxhighlight lang="lang"> # Load the IO module # Replace "<pathToIO.lm>" with the location where the io.lm Lang module was installed to without "<" and ">" ln.loadModule(<pathToIO.lm>) fp.fileSizeDistribution = (&sizes, $[totalSize], $file) -> { if([[io]]::fp.isDirectory($file)) { &fileNames = [[io]]::fp.listFilesAndDirectories($file) $path = [[io]]::fp.getCanonicalPath($file) if($path == /) { $path = \e } $fileName foreach($[fileName], &fileNames) { $innerFile = [[io]]::fp.openFile($path/$fileName) $innerTotalSize = 0L fp.fileSizeDistribution(&sizes, $innerTotalSize, $innerFile) $totalSize += $innerTotalSize [[io]]::fp.closeFile($innerFile) } }else { $len = [[io]]::fp.getSize($file) if($len == null) { return } $totalSize += $len if($len == 0) { &sizes[0] += 1 }else { $index = fn.int(fn.log10($len)) &sizes[$index] += 1 } } } $path $= @&LANG_ARGS == 1?&LANG_ARGS[0]:{{{./}}} &sizes = fn.arrayMake(12) fn.arraySetAll(&sizes, 0) $file = [[io]]::fp.openFile($path) $totalSize = 0L fp.fileSizeDistribution(&sizes, $totalSize, $file) [[io]]::fp.closeFile($file) fn.println(File size distribution for "$path":) $i repeat($[i], @&sizes) { fn.printf(10 ^% 3d bytes: %d%n, $i, parser.op(&sizes[$i])) } fn.println(Number of files: fn.arrayReduce(&sizes, 0, fn.add)) fn.println(Total file size: $totalSize) </syntaxhighlight> =={{header\|Mathematica}} / {{header\|Wolfram Language}}== <syntaxhighlight lang="mathematica">SetDirectory[NotebookDirectory[]]; Histogram[FileByteCount /@ Select[FileNames[__], DirectoryQ /* Not], {"Log", 15}, {"Log", "Count"}]</syntaxhighlight> =={{header\|Nim}}== <syntaxhighlight lang="nim">import math, os, strformat const MaxPower = 10 Powers = [1, 10, 100] func powerWithUnit(idx: int): string = ## Return a string representing value 10^idx with a unit. if idx < 0: "0B" elif idx < 3: fmt"{Powers[idx]}B" elif idx < 6: fmt"{Powers[idx - 3]}kB" elif idx < 9: fmt"{Powers[idx - 6]}MB" else: fmt"{Powers[idx - 9]}GB" # Retrieve the directory path. var dirpath: string if paramCount() == 0: dirpath = getCurrentDir() else: dirpath = paramStr(1) if not dirExists(dirpath): raise newException(ValueError, "wrong directory path: " & dirpath) # Distribute sizes. var counts: array[-1..MaxPower, Natural] for path in dirpath.walkDirRec(): if not path.fileExists(): continue # Not a regular file. let size = getFileSize(path) let index = if size == 0: -1 else: log10(size.float).toInt inc counts[index] # Display distribution. let total = sum(counts) echo "File size distribution for directory: ", dirpath echo "" for idx, count in counts: let rangeString = fmt"[{powerWithUnit(idx)}..{powerWithUnit(idx + 1)}[:" echo fmt"Size in {rangeString: 14} {count:>7} {100 * count / total:5.2f}%" echo "" echo "Total number of files: ", sum(counts)</syntaxhighlight> {{out}} <pre>File size distribution for directory: /home/xxx Size in [0B..1B[: 2782 1.28% Size in [1B..10B[: 145 0.07% Size in [10B..100B[: 2828 1.30% Size in [100B..1kB[: 20781 9.55% Size in [1kB..10kB[: 85469 39.29% Size in [10kB..100kB[: 86594 39.81% Size in [100kB..1MB[: 16629 7.64% Size in [1MB..10MB[: 2053 0.94% Size in [10MB..100MB[: 221 0.10% Size in [100MB..1GB[: 38 0.02% Size in [1GB..10GB[: 0 0.00% Size in [10GB..100GB[: 0 0.00% Total number of files: 217540</pre> =={{header\|Perl}}== {{trans\|~~Perl 6~~Raku}} <~~lang~~syntaxhighlight lang="perl">use File::Find; use List::Util qw(max); Line 302 ⟶ 1,406: sub fsize { $fsize{ log10( (lstat($_))[7] ) }++ } sub log10 { my($s) = @_; $s ? int log($s)/log(10) : 0 }</~~lang~~syntaxhighlight> {{out}} <pre>File size distribution in bytes for directory: . Line 314 ⟶ 1,418: 77843 total files.</pre> =={{header\|~~Perl 6~~Phix}}== Works on Windows and Linux. Uses "proper" sizes, ie 1MB==1024KB. Can be quite slow at first, but is pretty fast on the second and subsequent runs, that is once the OS has cached its (low-level) directory reads. ~~{{works with\|Rakudo\|2017.05}}~~ <!--<syntaxhighlight lang="phix">(notonline)--> ~~By default, process the current and all readable sub-directories, or, pass in a directory path at the command line.~~ <span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- file i/o</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">sizes</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">1</span><span style="color: #0000FF;">},</span> ~~<lang perl6>sub MAIN($dir = '.') {~~ <span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{</span><span style="color: #000000;">0</span><span style="color: #0000FF;">}</span> ~~sub log10 (Int $s) { $s ?? $s.log(10).Int !! 0 }~~ <span style="color: #004080;">atom</span> <span style="color: #000000;">t1</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">time</span><span style="color: #0000FF;">()+</span><span style="color: #000000;">1</span> ~~my %fsize;~~ ~~my @dirs = $dir.IO;~~ <span style="color: #008080;">function</span> <span style="color: #000000;">store_res</span><span style="color: #0000FF;">(</span><span style="color: #004080;">string</span> <span style="color: #000000;">filepath</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">dir_entry</span><span style="color: #0000FF;">)</span> ~~while @dirs {~~ <span style="color: #008080;">if</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">find</span><span style="color: #0000FF;">(</span><span style="color: #008000;">'d'</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">dir_entry</span><span style="color: #0000FF;">[</span><span style="color: #004600;">D_ATTRIBUTES</span><span style="color: #0000FF;">])</span> <span style="color: #008080;">then</span> ~~for @dirs.pop.dir -> $path {~~ <span style="color: #004080;">atom</span> <span style="color: #000000;">size</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">dir_entry</span><span style="color: #0000FF;">[</span><span style="color: #004600;">D_SIZE</span><span style="color: #0000FF;">]</span> ~~%fsize{$path.s.&log10}++ if $path.f;~~ <span style="color: #004080;">integer</span> <span style="color: #000000;">sdx</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span> ~~@dirs.push: $path if $path.d and $path.r~~ <span style="color: #008080;">while</span> <span style="color: #000000;">size</span><span style="color: #0000FF;">></span><span style="color: #000000;">sizes</span><span style="color: #0000FF;">[</span><span style="color: #000000;">sdx</span><span style="color: #0000FF;">]</span> <span style="color: #008080;">do</span> } <span style="color: #008080;">if</span> <span style="color: #000000;">sdx</span><span style="color: #0000FF;">=</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">sizes</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> } <span style="color: #000000;">sizes</span> <span style="color: #0000FF;">&=</span> <span style="color: #000000;">sizes</span><span style="color: #0000FF;">[$]</span><span style="color: #008080;">iff</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">mod</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">sizes</span><span style="color: #0000FF;">),</span><span style="color: #000000;">3</span><span style="color: #0000FF;">)?</span><span style="color: #000000;">10</span><span style="color: #0000FF;">:</span><span style="color: #000000;">10.24</span><span style="color: #0000FF;">)</span> ~~my $max = %fsize.values.max;~~ <span style="color: #000000;">res</span> <span style="color: #0000FF;">&=</span> <span style="color: #000000;">0</span> ~~my $bar-size = 80;~~ <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> ~~say "File size distribution in bytes for directory: $dir\n";~~ <span style="color: #000000;">sdx</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> ~~say sprintf( "# Files @ 0b %8s: ", %fsize{0} // 0 ),~~ <span style="color: #008080;">end</span> <span style="color: #008080;">while</span> ~~histogram( $max, %fsize{0} // 0, $bar-size );~~ <span style="color: #000000;">res</span><span style="color: #0000FF;">[</span><span style="color: #000000;">sdx</span><span style="color: #0000FF;">]</span> <span style="color: #0000FF;">+=</span> <span style="color: #000000;">1</span> ~~for 1 .. %fsize.keys.max {~~ <span style="color: #008080;">if</span> <span style="color: #7060A8;">time</span><span style="color: #0000FF;">()></span><span style="color: #000000;">t1</span> <span style="color: #008080;">then</span> ~~say sprintf( "# Files @ %5sb %8s: ", "10e{$_-1}", %fsize{$_} // 0 ),~~ <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"%,d files found\r"</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">sum</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">))</span> ~~histogram( $max, %fsize{$_} // 0, $bar-size )~~ <span style="color: #000000;">t1</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">time</span><span style="color: #0000FF;">()+</span><span style="color: #000000;">1</span> } <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> ~~say %fsize.values.sum, ' total files.';~~ <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> } <span style="color: #008080;">return</span> <span style="color: #000000;">0</span> <span style="color: #000080;font-style:italic;">-- keep going</span> <span style="color: #008080;">end</span> <span style="color: #008080;">function</span> ~~sub histogram ($max, $value, $width = 60) {~~ <span style="color: #004080;">integer</span> <span style="color: #000000;">exit_code</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">walk_dir</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"."</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">store_res</span><span style="color: #0000FF;">,</span> <span style="color: #004600;">true</span><span style="color: #0000FF;">)</span> ~~my @blocks = <\| ▏ ▎ ▍ ▌ ▋ ▊ ▉ █>;~~ ~~my $scaled = ($value $width / $max).Int;~~ <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"%,d files found\n"</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">sum</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">))</span> ~~my ($end, $bar) = $scaled.polymod(8);~~ <span style="color: #004080;">integer</span> <span style="color: #000000;">w</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">max</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">)</span> ~~(@blocks[8] x $bar * 8) ~ (@blocks[$end] if $end) ~ "\n"~~ <span style="color: #000080;font-style:italic;">--include builtins/pfile.e</span> ~~}</lang>~~ <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">ri</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">res</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">]</span> <span style="color: #004080;">string</span> <span style="color: #000000;">s</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">file_size_k</span><span style="color: #0000FF;">(</span><span style="color: #000000;">sizes</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">],</span> <span style="color: #000000;">5</span><span style="color: #0000FF;">),</span> <span style="color: #000000;">p</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">repeat</span><span style="color: #0000FF;">(</span><span style="color: #008000;">''</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">floor</span><span style="color: #0000FF;">(</span><span style="color: #000000;">60</span><span style="color: #0000FF;"></span><span style="color: #000000;">ri</span><span style="color: #0000FF;">/</span><span style="color: #000000;">w</span><span style="color: #0000FF;">))</span> <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"files < %s: %s%,d\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">s</span><span style="color: #0000FF;">,</span><span style="color: #000000;">p</span><span style="color: #0000FF;">,</span><span style="color: #000000;">ri</span><span style="color: #0000FF;">})</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <!--</syntaxhighlight>--> {{out}} <pre> ~~<pre>File size distribution in bytes for directory: /home~~ 112,160 files found #files ~~Files~~< @ ~~0b 989~~1: ▏333 files < 10: 911 files < 100: ***4,731 ~~# Files @ 10e0b 6655: ████████~~ files < 1KB: ****************************24,332 files < 10KB: ********************************************************45,379 ~~# Files @ 10e1b 31776: ████████████████████████████████████████~~ files < 100KB: *****************************25,299 files < 1MB: **********10,141 # Files @ 10e2b 63165: ████████████████████████████████████████████████████████████████████████████████ files < 10MB: 933 files < 100MB: 91 ~~# Files @ 10e3b 19874: ████████████████████████▏~~ files < 1GB: 8 files < 10GB: 2 ~~# Files @ 10e4b 7730: ████████▏~~ </pre> ~~# Files @ 10e5b 3418: ▌~~ ~~# Files @ 10e6b 1378: ▏~~ ~~# Files @ 10e7b 199:~~ ~~# Files @ 10e8b 45:~~ ~~135229 total files.</pre>~~ =={{header\|Python}}== The distribution is stored in a '''collections.Counter''' object (like a dictionary with automatic 0 value when a key is not found, useful when incrementing). Anything could be done with this object, here the number of files is printed for increasing sizes. No check is made during the directory walk: usually, safeguards would be needed or the program will fail on any unreadable file or directory (depending on rights, or too deep paths, for instance). Here links are skipped, so it should avoid cycles. <~~lang~~syntaxhighlight lang="python">import sys, os from collections import Counter Line 398 ⟶ 1,499: for dir in arg: dodir(dir) s = n = 0 for k, v in sorted(h.items()): Line 406 ⟶ 1,507: print("Total %d bytes for %d files" % (s, n)) main(sys.argv[1:])</~~lang~~syntaxhighlight> =={{header\|Racket}}== <~~lang~~syntaxhighlight lang="racket">#lang racket (define (file-size-distribution (d (current-directory)) #:size-group-function (sgf values)) Line 437 ⟶ 1,538: (module+ test (call-with-values (λ () (file-size-distribution #:size-group-function log10-or-so)) (report-fsd log10-or-so)))</~~lang~~syntaxhighlight> {{out}} Line 449 ⟶ 1,550: log10-or-so(size): 6.0 -> 6 files Total: 10210127 bytes in 733 files</pre> =={{header\|Raku}}== (formerly Perl 6) {{works with\|Rakudo\|2017.05}} By default, process the current and all readable sub-directories, or, pass in a directory path at the command line. <syntaxhighlight lang="raku" line>sub MAIN($dir = '.') { sub log10 (Int $s) { $s ?? $s.log(10).Int !! 0 } my %fsize; my @dirs = $dir.IO; while @dirs { for @dirs.pop.dir -> $path { %fsize{$path.s.&log10}++ if $path.f; @dirs.push: $path if $path.d and $path.r } } my $max = %fsize.values.max; my $bar-size = 80; say "File size distribution in bytes for directory: $dir\n"; for 0 .. %fsize.keys.max { say sprintf( "# Files @ %5sb %8s: ", $_ ?? "10e{$_-1}" !! 0, %fsize{$_} // 0 ), histogram( $max, %fsize{$_} // 0, $bar-size ) } say %fsize.values.sum, ' total files.'; } sub histogram ($max, $value, $width = 60) { my @blocks = <\| ▏ ▎ ▍ ▌ ▋ ▊ ▉ █>; my $scaled = ($value * $width / $max).Int; my ($end, $bar) = $scaled.polymod(8); (@blocks[8] x $bar * 8) ~ (@blocks[$end] if $end) ~ "\n" }</syntaxhighlight> {{out}} <pre>File size distribution in bytes for directory: /home # Files @ 0b 989: ▏ # Files @ 10e0b 6655: ████████ # Files @ 10e1b 31776: ████████████████████████████████████████ # Files @ 10e2b 63165: ████████████████████████████████████████████████████████████████████████████████ # Files @ 10e3b 19874: ████████████████████████▏ # Files @ 10e4b 7730: ████████▏ # Files @ 10e5b 3418: ▌ # Files @ 10e6b 1378: ▏ # Files @ 10e7b 199: # Files @ 10e8b 45: 135229 total files.</pre> =={{header\|REXX}}== This REXX version works for Microsoft Windows using the   '''dir'''   subcommand;   extra code was added for <br>older versions of Windows that used suffixes to express big numbers   (the size of a file),   and also versions <br>that used a mixed case for showing the output text. ~~<lang rexx>/REXX program displays a histogram of filesize distribution of a directory structure(s)/~~ Also, some Windows versions of the   '''dir'''   command insert commas into numbers, so code was added to elide them. <syntaxhighlight lang="rexx">/REXX program displays a histogram of filesize distribution of a directory structure(s)/ numeric digits 30 /ensure enough decimal digits for a #./ parse arg ds . /obtain optional argument from the CL./ Line 468 ⟶ 1,628: do while lines(work)\==0; _= linein(work) /process the data in the DIR work file/ if left(_, 1)==' ' then iterate /Is the record not legitimate? Skip. / parse upper var _ . . sz . /uppercase the suffix (if any). ~~/uppercase suffix~~/ sz= space( translate(sz, , ','), 0) /remove any commas if present in the #/ if \datatype(sz,'W') then do; #= left(sz, length(sz) - 1) /SZ has a suffix?/ if \datatype(#,'N') then iterate /Meat ¬ numeric? / Line 497 ⟶ 1,659: exit /stick a fork in it, we're all done. / /──────────────────────────────────────────────────────────────────────────────────────/ commas: parse arg _; do j#=length(_)-3 to 1 by -3; _=insert(',', _, j#); end; return _</~~lang~~syntaxhighlight> This REXX program makes use of   '''LINESIZE'''   REXX program (or BIF) which is used to determine the screen width (or linesize) of the terminal (console) so as to maximize the width of the histogram. The   '''LINESIZE.REX'''   REXX program is included here   ──►   [[LINESIZE.REX]].<br> Line 556 ⟶ 1,718: 20,156 files detected, 1,569,799,557 total bytes. </pre> =={{header\|Rust}}== Will search and report on the directory the .exe is in if target is otherwise unspecified. {{libheader\|walkdir}} {{works with\|Rust\|2018}} <syntaxhighlight lang="rust"> use std::error::Error; use std::marker::PhantomData; use std::path::{Path, PathBuf}; use std::{env, fmt, io, time}; use walkdir::{DirEntry, WalkDir}; fn main() -> Result<(), Box<dyn Error>> { let start = time::Instant::now(); let args: Vec<String> = env::args().collect(); let root = parse_path(&args).expect("not a valid path"); let dir = WalkDir::new(&root); let (files, dirs): (Vec<PathBuf>, Vec<PathBuf>) = { let pool = pool(dir).expect("unable to retrieve entries from WalkDir"); partition_from(pool).expect("unable to partition files from directories") }; let (fs_count, dr_count) = (files.len(), dirs.len()); let (file_counter, total_size) = file_count(files); { println!("++ File size distribution for : {} ++\n", &root.display()); println!("Files @ 0B : {:4}", file_counter[0]); println!("Files > 1B - 1,023B : {:4}", file_counter[1]); println!("Files > 1KB - 1,023KB : {:4}", file_counter[2]); println!("Files > 1MB - 1,023MB : {:4}", file_counter[3]); println!("Files > 1GB - 1,023GB : {:4}", file_counter[4]); println!("Files > 1TB+ : {:4}\n", file_counter[5]); println!("Files encountered: {}", fs_count); println!("Directories traversed: {}", dr_count); println!( "Total size of all files: {}\n", Filesize::<Kilobytes>::from(total_size) ); } let end = time::Instant::now(); println!("Run time: {:?}\n", end.duration_since(start)); Ok(()) } fn parse_path(args: &[String]) -> Result<&Path, io::Error> { // If there's no `args` entered, the executable will search it's own path. match args.len() { 1 => Ok(Path::new(&args[0])), _ => Ok(Path::new(&args[1])), } } fn pool(dir: WalkDir) -> Result<Vec<DirEntry>, Box<dyn Error>> { // Check each item for errors and drop possible invalid `DirEntry`s Ok(dir.into_iter().filter_map(\|e\| e.ok()).collect()) } fn partition_from(pool: Vec<DirEntry>) -> Result<(Vec<PathBuf>, Vec<PathBuf>), Box<dyn Error>> { // Read `Path` from `DirEntry`, checking if `Path` is a file or directory. Ok(pool .into_iter() .map(\|e\| e.into_path()) .partition(\|path\| path.is_file())) } fn file_count(files: Vec<PathBuf>) -> ([u64; 6], u64) { let mut counter: [u64; 6] = [0; 6]; for file in &files { match Filesize::<Bytes>::from(file).bytes { 0 => counter[0] += 1, // Empty file 1..=1_023 => counter[1] += 1, // 1 byte to 0.99KB 1_024..=1_048_575 => counter[2] += 1, // 1 kilo to 0.99MB 1_048_576..=1_073_741_823 => counter[3] += 1, // 1 mega to 0.99GB 1_073_741_824..=1_099_511_627_775 => counter[4] += 1, // 1 giga to 0.99TB 1_099_511_627_776..=std::u64::MAX => counter[5] += 1, // 1 terabyte or larger } } let total_file_size = files .iter() .fold(0, \|acc, file\| acc + Filesize::<Bytes>::from(file).bytes); (counter, total_file_size) } trait SizeUnit: Copy { fn singular_name() -> String; fn num_byte_in_unit() -> u64; } #[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)] struct Bytes; impl SizeUnit for Bytes { fn singular_name() -> String { "B".to_string() } fn num_byte_in_unit() -> u64 { 1 } } #[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)] struct Kilobytes; impl SizeUnit for Kilobytes { fn singular_name() -> String { "KB".to_string() } fn num_byte_in_unit() -> u64 { 1_024 } } #[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Debug)] struct Filesize<T: SizeUnit> { bytes: u64, unit: PhantomData<T>, } impl<T> From<u64> for Filesize<T> where T: SizeUnit, { fn from(n: u64) -> Self { Filesize { bytes: n T::num_byte_in_unit(), unit: PhantomData, } } } impl<T> From<Filesize<T>> for u64 where T: SizeUnit, { fn from(fsz: Filesize<T>) -> u64 { ((fsz.bytes as f64) / (T::num_byte_in_unit() as f64)) as u64 } } impl<T> fmt::Display for Filesize<T> where T: SizeUnit, { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { // convert value in associated units to float let size_val = ((self.bytes as f64) / (T::num_byte_in_unit() as f64)) as u64; // plural? let name_plural = match size_val { 1 => "", _ => "s", }; write!( f, "{} {}{}", (self.bytes as f64) / (T::num_byte_in_unit() as f64), T::singular_name(), name_plural ) } } // Can be expanded for From<File>, or any type that has an alias for Metadata impl<T> From<&PathBuf> for Filesize<T> where T: SizeUnit, { fn from(f: &PathBuf) -> Self { Filesize { bytes: f .metadata() .expect("error with metadata from pathbuf into filesize") .len(), unit: PhantomData, } } } </syntaxhighlight> {{out}} <pre> ++ File size distribution for : .\Documents ++ Files @ 0B : 956 Files > 1B - 1,023B : 3724 Files > 1KB - 1,023KB : 4511 Files > 1MB - 1,023MB : 930 Files > 1GB - 1,023GB : 0 Files > 1TB+ : 0 Files encountered: 10121 Directories traversed: 2057 Total size of all files: 5264133277 KBs Run time: 1.5671626s </pre> =={{header\|Sidef}}== <~~lang~~syntaxhighlight lang="ruby">func traverse(Block callback, Dir dir) { dir.open(\var dir_h) \|\| return nil Line 587 ⟶ 1,949: } say "Total: #{total_size} bytes in #{files_num} files"</~~lang~~syntaxhighlight> {{out}} <pre> Line 600 ⟶ 1,962: log10(size) ~~ 8 -> 2 files Total: 370026462 bytes in 2650 files </pre> =={{header\|Tcl}}== This is with the '''fileutil::traverse''' package from Tcllib to do the tree walking, a '''glob''' based alternative ignoring links but not hidden files is possible but would add a dozen of lines. <syntaxhighlight lang="tcl">package require fileutil::traverse namespace path {::tcl::mathfunc ::tcl::mathop} # Ternary helper proc ? {test a b} {tailcall if $test [list subst $a] [list subst $b]} set dir [? {$argc} {[lindex $argv 0]} .] fileutil::traverse Tobj $dir \ -prefilter {apply {path {ne [file type $path] link}}} \ -filter {apply {path {eq [file type $path] file}}} Tobj foreach path { set size [file size $path] dict incr hist [? {$size} {[int [log10 $size]]} -1] } Tobj destroy foreach key [lsort -int [dict keys $hist]] { puts "[? {$key == -1} 0 {1e$key}]\t[dict get $hist $key]" }</syntaxhighlight> {{out}} <pre>0 1 1e1 339 1e2 3142 1e3 2015 1e4 150 1e5 29 1e6 13 1e7 3</pre> =={{header\|UNIX Shell}}== {{works with\|Bourne Shell}} Use POSIX conformant code unless the environment variable GNU is set to anything not empty. <syntaxhighlight lang="sh">#!/bin/sh set -eu tabs -8 if [ ${GNU:-} ] then find -- "${1:-.}" -type f -exec du -b -- {} + else # Use a subshell to remove the last "total" line per each ARG_MAX find -- "${1:-.}" -type f -exec sh -c 'wc -c -- "$@" \| sed \$d' argv0 {} + fi \| awk -vOFS='\t' ' BEGIN {split("KB MB GB TB PB", u); u[0] = "B"} { ++hist[$1 ? length($1) - 1 : -1] total += $1 } END { max = -2 for (i in hist) max = (i > max ? i : max) print "From", "To", "Count\n" for (i = -1; i <= max; ++i) { if (i in hist) { if (i == -1) print "0B", "0B", hist[i] else print 10 (i % 3) u[int(i / 3)], 10 ((i + 1) % 3) u[int((i + 1) / 3)], hist[i] } } l = length(total) - 1 printf "\nTotal: %.1f %s in %d files\n", total / (10 ** l), u[int(l / 3)], NR }'</syntaxhighlight> {{out}} <pre>$ time ~/fsd.sh From To Count 0B 0B 13 1B 10B 74 10B 100B 269 100B 1KB 5894 1KB 10KB 12727 10KB 100KB 12755 100KB 1MB 110922 1MB 10MB 50019 10MB 100MB 17706 100MB 1GB 5056 1GB 10GB 1139 10GB 100GB 141 100GB 1TB 1 Total: 8.9 TB in 216716 files ~/fsd.sh 1.28s user 2.55s system 134% cpu 2.842 total $ time GNU=1 ~/fsd.sh From To Count 0B 0B 13 1B 10B 74 10B 100B 269 100B 1KB 5894 1KB 10KB 12727 10KB 100KB 12755 100KB 1MB 110922 1MB 10MB 50019 10MB 100MB 17706 100MB 1GB 5056 1GB 10GB 1139 10GB 100GB 141 100GB 1TB 1 Total: 8.9 TB in 216716 files GNU=1 ~/fsd.sh 0.81s user 1.33s system 135% cpu 1.586 total</pre> =={{header\|Wren}}== {{libheader\|Wren-math}} {{libheader\|Wren-fmt}} <syntaxhighlight lang="wren">import "io" for Directory, File, Stat import "os" for Process import "./math" for Math import "./fmt" for Fmt var sizes = List.filled(12, 0) var totalSize = 0 var numFiles = 0 var numDirs = 0 var fileSizeDist // recursive function fileSizeDist = Fn.new { \|path\| var files = Directory.list(path) for (file in files) { var path2 = "%(path)/%(file)" var stat = Stat.path(path2) if (stat.isFile) { numFiles = numFiles + 1 var size = stat.size if (size == 0) { sizes[0] = sizes[0] + 1 } else { totalSize = totalSize + size var logSize = Math.log10(size) var index = logSize.floor + 1 sizes[index] = sizes[index] + 1 } } else if (stat.isDirectory) { numDirs = numDirs + 1 fileSizeDist.call(path2) } } } var args = Process.arguments var path = (args.count == 0) ? "./" : args[0] if (!Directory.exists(path)) Fiber.abort("Path does not exist or is not a directory.") fileSizeDist.call(path) System.print("File size distribution for '%(path)' :-\n") for (i in 0...sizes.count) { System.write((i == 0) ? " " : "+ ") Fmt.print("Files less than 10 ^ $-2d bytes : $,5d", i, sizes[i]) } System.print(" -----") Fmt.print("= Number of files : $,5d", numFiles) Fmt.print(" Total size in bytes : $,d", totalSize) Fmt.print(" Number of sub-directories : $,5d", numDirs)</syntaxhighlight> {{out}} <pre> File size distribution for './' :- Files less than 10 ^ 0 bytes : 4 + Files less than 10 ^ 1 bytes : 2 + Files less than 10 ^ 2 bytes : 135 + Files less than 10 ^ 3 bytes : 946 + Files less than 10 ^ 4 bytes : 746 + Files less than 10 ^ 5 bytes : 79 + Files less than 10 ^ 6 bytes : 11 + Files less than 10 ^ 7 bytes : 3 + Files less than 10 ^ 8 bytes : 0 + Files less than 10 ^ 9 bytes : 0 + Files less than 10 ^ 10 bytes : 0 + Files less than 10 ^ 11 bytes : 0 ----- = Number of files : 1,926 Total size in bytes : 12,683,455 Number of sub-directories : 3 </pre> =={{header\|zkl}}== <~~lang~~syntaxhighlight lang="zkl">pipe:=Thread.Pipe(); // hoover all files in tree, don't return directories fcn(pipe,dir){ File.globular(dir,"",True,8,pipe); } Line 621 ⟶ 2,169: println("%15s : %s".fmt(szchrs[idx,], ""(scalecnt).round().toInt())); idx-=1 + comma(); }</~~lang~~syntaxhighlight> {{out}} <pre> Line 638 ⟶ 2,186: Found 4320 files, 67,627,849,052 bytes, 15,654,594 mean. File size Number of files ( = 69.84) n : nn : nnn : n,nnn : * nn,nnn : nnn,nnn : n,nnn,nnn : * nn,nnn,nnn : **************************************************