Just Return Any Random Row

Sometime I see a developer try to run something like “select deptno,sal from emp group by deptno”. I usually ask the developer “But employees in the department have many different salaries. Which one do you want?”, and sometimes I get an amazing answer: “I don’t really care, I just want to see any random salary”. Usually, I tell her to aggregate by “max” or something similar, if she doesn’t care about the result.

But today I was really annoyed. So I wrote a custom aggregation function that will return a random salary.  I admit, it is not as useful as str_agg, but there seem to be few developers who are interested in this feature.

The main challenge was to make the aggregation truly random. When you aggregate, you always have the current aggregation value and a new one. I have to randomly choose one of them – but I can’t make it a simple 50/50 selection.

Suppose I have three rows. The way aggregation works, I first take two rows and flip a coin to pick one. Now I have a current value – and I have to take the third row and decide if I want to keep the current value or the new one. I can’t flip the coin again – because if the third row has 50% chance to be selected, this means the first and second rows only have 25% chance each. Not fair. So I need to give the third row 1/3 chance, and the current value 2/3.

create or replace type agg_t as object (

    curr_value number,
    running_count number,

    static function ODCIAggregateInitialize(sctx  in out agg_t)
                    return number,

    member function ODCIAggregateIterate   (self  in out agg_t,
                                            new_value in number)
                    return number,

    member function ODCIAggregateTerminate (self         in     agg_t   ,
                                            return_value    out number,
                                            flags        in number      )
                    return number,

    member function ODCIAggregateMerge(self in out agg_t,
                                       ctx2 in agg_t    )
                    return number
);
/

create or replace type body agg_t is 

    static function ODCIAggregateInitialize(sctx in out agg_t)
        return number is
    begin
        sctx := agg_t(null,0);
        return ODCIConst.Success;
    end;

    member function ODCIAggregateIterate(
      self in out agg_t, new_value in number)
        return number is
    begin
        running_count:=running_count+1;
        if (mod(dbms_random.random,running_count)=0) then
        	curr_value:=new_value;
        end if;
        return ODCIConst.Success;
    end;

    member function ODCIAggregateTerminate(self in agg_t,
        return_value out number, flags in number) return number is
    begin
        return_value := curr_value;
        return ODCIConst.Success;
    end;

    member function ODCIAggregateMerge(self in out agg_t,
        ctx2 in agg_t) return number is
    begin

    	if (mod(dbms_random.random,running_count+ctx2.running_count)<ctx2.running_count) then
    		curr_value:=ctx2.curr_value;
    	end if;
        running_count:=running_count+ctx2.running_count;
        return ODCIConst.Success;
    end;
end;
/

create or replace function agg_random (input number) return number
    parallel_enable aggregate using agg_t;
/

So easy! I love user defined aggregations! But you have to be careful when writing them. I accidentally replaced “number” with “varchar” somewhere in the code and got a lovely error message when I tried to run it:

SQL> select deptno,agg_random(sal) from emp group by deptno;
select deptno,agg_random(sal) from emp group by deptno
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 2104
Session ID: 138 Serial number: 1496

And in the alert log: Exception [type: ACCESS_VIOLATION, UNABLE_TO_WRITE] [ADDR:0x0] [PC:0x3D31168, ___intel_new_memcpy()+40]
I segfaulted! and you can imagine how fun it was to debug my pl/sql code using the trace file…

Anyway, after all the debugging is done, its time to show the code to my developer:

SQL> select deptno,agg_random(sal) from emp group by deptno;

    DEPTNO AGG_RANDOM(SAL)
---------- ---------------
        10            2450
        20            2975
        30             950

SQL> select deptno,agg_random(sal) from emp group by deptno;

    DEPTNO AGG_RANDOM(SAL)
---------- ---------------
        10            1300
        20            2975
        30            1600

SQL> select deptno,agg_random(sal) from emp group by deptno;

    DEPTNO AGG_RANDOM(SAL)
---------- ---------------
        10            2450
        20            2975
        30             950

SQL> select deptno,agg_random(sal) from emp group by deptno;

    DEPTNO AGG_RANDOM(SAL)
---------- ---------------
        10            2450
        20            3000
        30            2850

But I still wasn’t happy. What if I have a bug and the selection is not random enough? After all, the values in the example seems a bit repetitive.

No problem. I’ll use my function on a table containing numbers 1 to 100, pick few random numbers, and then use the well known chi-square test to check if my random selection matches the uniform distribution.

SQL> create table r1 as select level l from dual connect by level<100;

Table created.

SQL> create table r2 (n number);

Table created.

SQL> insert into r2 select agg_random(l) from r1 group by ();

1 row created.

-- repeated 20 times

SQL> select count(*) from r2;

  COUNT(*)
----------
        20
        
        
SQL> DECLARE
  2   sig   NUMBER;
  3   alpha NUMBER := 1;
  4   beta  NUMBER := 99;
  5   ttype VARCHAR2(20) := 'CHI_SQUARED';
  6  BEGIN
  7    dbms_stat_funcs.uniform_dist_fit('SCOTT', 'R2', 'N', 'DISCRETE', ttype, a
lpha, beta, sig);
  8
  9    dbms_output.put_line(sig);
 10  END;
 11  /
X-squared value : .9000000000000000000000000000000000000026 
Degree of freedom : 18.0000000013919621366679664

PL/SQL procedure successfully completed.

90% probability of getting my result assuming the function is good. I’m so happy. I knew that getting a degree in statistics was not a complete waste of time :)

About these ads

8 Comments on “Just Return Any Random Row”

  1. alternative:

    select
    deptno,
    max(sal)keep(dense_rank first order by dbms_random.value)sal
    from emp group by deptno;
    DEPTNO SAL
    ------ ----------
    10 5000
    20 2975
    30 950

  2. Adi Stav says:

    You are, of course, in a state of sin.

  3. prodlife says:

    Adi,
    I didn’t generate the random digits, so if anyone is in state of sin, it is Oracle (and I hope it is not).

  4. Jens Petersen says:

    I have experienced a similar bug using aggregate function based on an object type some time ago and opened a SR.
    Base bug# is 5386874 and will be fixed in 11.2

    HTH
    Jens

  5. chet says:

    i think that’s a tad disturbing… ;)

  6. prodlife says:

    Thanks, Jens!

  7. […] stuff now. Chen Shapira, the not-so-simple DBA puts her Statistics degree to good use by building a custom aggregation function that will return a random salary using Oracle’s Data Cartridge extensibility features. “The main challenge was to make […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,114 other followers