Need to do a multi-key sort on a legacy text file that has no delimiters

I need to do a multi-key sort on a legacy text file that has no delimiters.

This is a relatively simple task on Unix but the Windows Command Shell cannot sort based on multiple keys - for example - column 5 thru 10 and 20 thru 30.

I decided to use Powershell to overcome this limitation but still struggling to find the answer.

So far i have attempted this but does not work:

$data = Get-Content  C:\data\test.txt | Sort-Object @{Expression={$_[16] - $_[59]},{$_[84] - $_[87]};Ascending=$true}

The data is like this - one long string:

CVMONLINREG_CUSTTC000000000000000000000003TC00000000000000032012040313202700000000800719CVMONLINTC000000000000000000000003

The shell hangs on running this and I have to kill the window.

Any suggestions are welcome at this point.

Thank you in advance

  • So how do you want it sorted?  I am assuming column widths are fixed. And can you post at least five lines of data? And if three are column names by which you refer to the data, what are they?

  • Here is the data:

    Column 17 thru 74 and 84 thru 88

    CVMONLIN        TC000000000000000000000002TC00000000000000022012040313202600000000700000       
    CVMONLIN        TC000000000000000000000002TC00000000000000022012040313202600000000600001CVMONLIN
    CVMONLIN        TC000000000000000000000002TC00000000000000022012040313202600000000800002CVMONLIN
    CVMONLINREG_CUSTTC000000000000000000000002TC00000000000000022012040313202600000000800003CVMONLIN
    CVMONLINREG_CUSTTC000000000000000000000002TC00000000000000022012040313202600000000800004CVMONLIN
    CVMONLIN        TC000000000000000000000001TC00000000000000012012040313202500000000700000       
    CVMONLIN        TC000000000000000000000001TC00000000000000012012040313202500000000600001CVMONLIN
    CVMONLIN        TC000000000000000000000001TC00000000000000012012040313202500000000800002CVMONLIN
    CVMONLINREG_CUSTTC000000000000000000000001TC00000000000000012012040313202500000000800003CVMONLIN

    and should look like this - last 4 records need to be put to the top - as criteria is TC000000000000000000000001TC000000000000000120120403132025:

    CVMONLIN        TC000000000000000000000001TC00000000000000012012040313202500000000700000       
    CVMONLIN        TC000000000000000000000001TC00000000000000012012040313202500000000600001CVMONLIN
    CVMONLIN        TC000000000000000000000001TC00000000000000012012040313202500000000800002CVMONLIN
    CVMONLINREG_CUSTTC000000000000000000000001TC00000000000000012012040313202500000000800003CVMONLIN
    CVMONLIN        TC000000000000000000000002TC00000000000000022012040313202600000000700000       
    CVMONLIN        TC000000000000000000000002TC00000000000000022012040313202600000000600001CVMONLIN
    CVMONLIN        TC000000000000000000000002TC00000000000000022012040313202600000000800002CVMONLIN
    CVMONLINREG_CUSTTC000000000000000000000002TC00000000000000022012040313202600000000800003CVMONLIN
    CVMONLINREG_CUSTTC000000000000000000000002TC00000000000000022012040313202600000000800004CVMONLIN

  • Based on what you told me about the columns (Column 17 thru 74 and 84 thru 88) this script would seem to do the trick.  However, it doesn't seem to produce the same output you have above.  For instance, lines 1 & 2 are reversed.  Mine looks correct according to your description though.  So let me know what I am missing.  My assumption was that you wanted the same output, sorted on columns 17-74 and then columns 84-88.

    $pattern = "^.{16}(?'field1'.{57}).{8}(?'field2'.{4})"
    $collection = @()
    cat .\indata.txt | foreach {
        [void]($PSItem -match $pattern)
        $dataProperties = @{
            Data = $PSItem;
            SortField1 = $Matches['field1'];
            SortField2 = $Matches['field2']
        }
        $dataProperties
        $collection += New-Object -TypeName PSObject -Property $dataProperties
    }
    $collection | sort -Property SortField1, SortField2 | select -ExpandProperty Data
  • Wow - that worked awesome. I cannot thank you enough. I sort of understand the logic but not the syntax - I am new to Powershell. Of course, that requires some learning on my part but thank you for giving me a head start on this.

    I did suppress the $dataProperties from going to the Console as well as piped the sorted data to the Output File as below.

    $pattern = "^.{16}(?'field1'.{58}).{9}(?'field2'.{5})"
    $collection = @()
    cat .\jtfauq10.txt | foreach {
        [void]($PSItem -match $pattern)
        $dataProperties = @{
            Data = $PSItem;
            SortField1 = $Matches['field1'];
            SortField2 = $Matches['field2']
        }
        $dataProperties  | out-null
        $collection += New-Object -TypeName PSObject -Property $dataProperties
    }
    $collection | sort -Property SortField1, SortField2 | select -ExpandProperty Data | Out-File $outFile

  • Dennis Soans said:
    $pattern = "^.{16}(?'field1'.{58}).{9}(?'field2'.{5})"

    The real magic is courtesy of RegEx (regular expression) that is available in several languages.  The $pattern string is evaluated by the -match operator.  Since all you had was raw text with no delimiters or readily defining text, AND you had fixed column sizes, I did simple matches based on column counts.  The '.{16}' says match exactly 16 characters of any type.  The '.{58}' says match exactly the next 58 characters.  By wrapping it '(?'field1'.{58})' it becomes a named capture.  That way I can refer to it by the name $Matches['field1'] later.

    I tried to keep it as simple as possible.  If there is anything else you have questions about, let me know.

  • Dennis Soans said:
    $dataProperties  | out-null

    You can delete that line altogether.  In the demo code it was used to show the properties that were going to go into the object.

    Did I miss something in your code?  Where do you actually set the $outFile variable?

  • The last line/word in the code snippet..!!!

    One question though - what implications would this code have on Memory utilization, as the Input file gets very large in a Production environment? I would think to do a sort, reading in all the data first is a pre-requisite. I am not sure if there is a way to break up the work load into smaller units unless there is a way to break up the input data into smaller logical work units and then do a final sort on the whole data set.

    I am not sure what impacts/issues to memory exist with the use of objects, like Powershell does, as opposed to other shell languages that do not.

  • I see where you send the output to $outfile.  However, I don't see in the script where you set the value of the $outfile variable.

    In terms of memory utilization, it's something you would have to baseline.  For instance, I don't read in the entire input file all at once.  I process it line-by-line in the pipeline.  For each line I create a PSObject that has the sort keys (Field1 and Field2).  But then I have to collect all the objects for later sorting and output.  And PowerShell is not very efficient about appending to arrays.

    Not knowing what "very large" means in your environment, it's hard to guess what the impact might be.  You can either watch it in Task Manager or run a PerfMon.  If the data comes from a production server that is already memory constrained, you may want to consider processing it on another system, or process it during off-peak hours.  Totally depends on your environment.

    I also don't know what the ultimate consumer is of the sorted data.  It's obviously not very human friendly, so I assume it's some other process.  Breaking it into smaller "chunks" might not be that useful, in that you would then still need to merge the subordinate chucks to make it useful to the downstream consumer.

     

  • Bob, hope you don't mind if I chime in.  As Bob mentioned performance isn't PowerShell's forte, and appending to an array is ugly.  It's the old, copy the entire array to a new array and put the new value in it.  Clearly as the array gets bigger the bigger this kind of performance hit takes.

    I ran some tests awhile back and I have noticed that PowerShell 3.0 is much better than 2.0 but it's still an issue.  For most things we do as scripters the performance hit isn't enough for us to do anything about it but occasionally it does come up--like now.

    Here's a technique I picked up from the Script Guy:

     

    $pattern = "^.{16}(?'field1'.{57}).{8}(?'field2'.{4})"
    $collection = cat c:\Test\indata.txt | foreach {
        [void]($PSItem -match $pattern)
        [PSCustomObject]@{
            Data = $PSItem
            SortField1 = $Matches['field1']
            SortField2 = $Matches['field2']
        }
    } 
    $collection | sort -Property SortField1, SortField2 | select -ExpandProperty Data

    I put this in a loop with the other method and went through the loop 10 times and got the average.  Here's what I came up with:

    Array append method: 7.74 milliseconds
    Direct assignment method (above): 4.47 milliseconds

    And this is over a VERY small dataset.  As it gets bigger this will translate into more and more performance gain.  Might be worth considering [:)]

     

  • Nice.  I like the direct assignment.  However, while it runs faster (and will thus be painful for a shorter period of time), you really don't get a clue about memory consumption.

  • Bob McCoy said:

    Nice.  I like the direct assignment.  However, while it runs faster (and will thus be painful for a shorter period of time), you really don't get a clue about memory consumption.

    True, but can it be any worse then array appending?  Especially on a large dataset?  I wonder if there would be any benefit to putting the PSCustomObject in a hash table (with some bogus key), then pull it back out with the Values property?

  • Actually, it was a smidge faster:

                                               Test1                                            Test2                                           Test3
                                               -----                                            -----                                           -----
                                             5.69063                                           7.2064                                         5.50992

    Test1 = Direct Assignmnet
    Test2 = Array Append
    Test3 = Hashtable

    Here's the hashtable code:

     

    $collection = @{}
    $Index = 0
    $pattern = "^.{16}(?'field1'.{57}).{8}(?'field2'.{4})"
    cat c:\Test\indata.txt | foreach {
        [void]($PSItem -match $pattern)
        $Collection.Add($Index,[PSCustomObject]@{
            Data = $PSItem
            SortField1 = $Matches['field1']
            SortField2 = $Matches['field2']
        })
        $Index ++
    } 
    $collection.values | sort -Property SortField1, SortField2 | select -ExpandProperty Data

     

  • Don't get me wrong.  I like the direct assignment.  But Simon was asking about memory consumption.  And you would have to measure that outside of PS as we don't really have the tools to measure that from the inside.

  • Bob McCoy said:

    Don't get me wrong.  I like the direct assignment.  But Simon was asking about memory consumption.  And you would have to measure that outside of PS as we don't really have the tools to measure that from the inside.

    No worries, I knew what you meant!

  • Martin you're a genius!  I got to noodling on the hash table comment.  But why a bogus key when you could generate the key based on the sort fields?  And then you just sort on the hash keys.  This what I came up with ...

    $pattern = "^.{16}(?'field1'.{58}).{9}(?'field2'.{5})"
    $collection = @{}
    cat .\jtfauq10.txt | foreach {
        [void]($PSItem -match $pattern)
        $collection[$($Matches['field1'] + $Matches['field2'])] = $PSItem
    }
    $collection.GetEnumerator() | sort Name # | Out-File $outFile

    I love PowerShell!  There are all sorts of ways to slice and dice this issue.

    Note:  For the demo script I've commented out Dennis's Out-File on the last line.