I found a thread regarding my question (shell - Different versions of UNIX sort handle case differently), but it gives the "opposite" answer so to speak.
I've messed about with the LANG variable, but can't seem to find a value that achieves my goal.
abc a Abc d Abc b abc e abæ g
Needs to be sorted to:
abc a abc c Abc b Abc d abæ g
Not this (which is what I currently get):
Abc b Abc d abc a abc c abæ g
And not this either (which is what I get when I sort case insensitive):
abc a Abc b abc c Abc d abæ g
In other words: I want a case sensitive sort per column where words with upper case starting letter is not sorted in the top and upper/lowercase version of the same word are not mixed together depending on the second column.
Notice that I need UTF-8 sensitive sort (in this case I used the danish letter "æ" which is placed in the alphabet like this: "...vwxyzæøå").
I am sorting on two columns using:
sort test.txt -k1,1 -k2,2
Any way I can do this without resorting to a script?
preguntado el 08 de noviembre de 11 a las 14:11
You don't want things of mixed case in the first column mixed together depending on what the second column has, but that is exactly what a case insensitive sort gives you. It considers things that share a casefold to be identical.
The sort of this set of Unicode records:
abc a Abc d Abc b abc e abæ g
is of course this:
abæ g abc a Abc b Abc d abc e
That's because the first and second letters are each “the same” (es decir,, their casefolds are identical) in all five lines, so the first different letter is the third, which being an æ of course comes before c, which is what the other four records have as their third letter.
With the remaining lines, they all have the same first three letters, so it is their fourth letter that is dispositive, giving now the sequence a, b, d, e. Spaces do not (normally) matter in a Unicode sort, because it is an alphanumeric sort not a code point sort. We only consider letters here unless they are identical all the way down to case, and only then are other code points considered.
That’s just how sorting Unicode works.
The Unicode Collation Algorithm does not pay attention to Danish ordering unless you ask it to. The default DUCET entry for that code point puts things like æ and å next to a, ø next to o. The OED sorts these entries in this order:
allergist allergy Allerød allers allethrin
That's because the o in "Allerød" follows the g in "allergy" and precedes the s in allers. Diacritics only matter if everything else is the same, so a hypothetical "alleroc" would precede "Allerød" and a hypothetical "allerog" would follow it but precede "allers".
That's just how sorting works in Unicode. Scandinavians hate it because they think it should just do whatever their idiosyncratic national systems do, but Unicode is not biased toward a particular language. If you want your idiotsyncrasies, you have to use locale sorting. To get a Danish locale-specific sort like this:
abc a Abc b Abc d abc e abæ g
You need to run your sort with a Danish locale specified, not in the broken POSIX way, but in the Unicode way.
First, you must give up on trying to use sort(1). It’s worse then useless: it’s unreliable and deceptive. If you have Unicode data, you should be using a Unicode sort, whether unmodified as the OED does or modified for your little village.
To produce the normal Unicode ordering, you must use:
#!/usr/bin/env perl use strict; use warnings; use open qw(:std :utf8); use utf8; use Unicode::Collate; my @lines = <<'End_of_Lines' =~ /\S.*\S\n/g; abc a Abc d Abc b abc e abæ g End_of_Lines my $collator = Unicode::Collate->new(); print $collator->sort(@lines);
While to get the locale-restricted non-default just-for-you sort, you need:
#!/usr/bin/env perl use strict; use warnings; use open qw(:std :utf8); use utf8; use Unicode::Collate::Locale; my @lines = <<'End_of_Lines' =~ /\S.*\S\n/g; abc a Abc d Abc b abc e abæ g End_of_Lines my $collator = Unicode::Collate::Locale->new(locale => "da"); print $collator->sort(@lines);
Unicode::Collate module is included standard since Perl release v5.6.
Unicode::Collate::Locale module is included standard since Perl release v5.14, but it trivially installable from CPAN on earlier releases:
$ sudo perl -MCPAN -e "install Unicode::Collate::Locale"
The reason you must use Perl for this is because you simply cannot trust vendor locales to work according to the Unicode Collation Algorithm, with or without locale modifications. I have never seen two different systems where they work the same way, which means that at least one of every pair is broken and perhaps both are. In contrast, you can guarantee that the UCA will Siempre hay behave the same way no matter where you are. It doesn’t care what your Terminal can display. It doesn’t care about fonts. It doesn’t care if you’re redirected. It doesn’t care about which shell you’re running. It doesn’t care whether your Aunt Gertrude happens to run the code on the 5th Monday in a month. It just works, and it works the same way every time in every situation. Use the UCA. Accept no substitutes.
But just because you use the UCA doesn’t mean you need to accept the default ordering. The UCA was designed to be super-amenable to tailoring. If you want a locale sort, this is easy — and if there’s CLDR data for that locale, it is positively trivial. If you want to do a sort of book and movie titles, or of people’s names with the surname counting stronger than the forename and with all the Scottish Mc- and Mac- names sorting before M- but irrespective of each other, all these things are very very easy with the UCA. Anything you can imagine can be done, and usually with astonishing ease. The point is that with the UCA, you always start with a behavior that is guaranteed to work exactly the same way irrespective of platform or prejudice. That means you can rely on how it works when you want to apply your own customizations to it. Without that guarantee, all is lost.
You can get a pre-made command-line replacement (well, sort of) for the Unix sort(1) program which is UCA compliant aquí. It doesn’t do fields of course, but it does do quite a bit more.