Launchpad itself

Code review comment for lp://staging/~gmb/launchpad/stored-proc-for-bug-heat-bug-582195

stored-proc-for-bug-heat-bug-582195
Merge into db-devel

Revision history for this message

Stuart Bishop (stub) wrote on 2010-05-27:

On Wed, May 26, 2010 at 10:26 PM, Graham Binns <email address hidden> wrote:

> === added file 'database/schema/patch-2207-99-0.sql'

There is no need for a database patch - the stored procedure should be
added to trusted.sql (which is becoming unwieldy - I think I need to
split it up).

You can do this in a single query rather than one per task:

SELECT MAX(GREATEST(Product.max_bug_heat, Distribution.max_bug_heat))
FROM BugTask
LEFT OUTER JOIN ProductSeries ON BugTask.productseries = ProductSeries.id
LEFT OUTER JOIN Product ON (
    BugTask.product = Product.id
    OR ProductSeries.product = Product.id)
LEFT OUTER JOIN DistroSeries ON BugTask.distroseries = DistroSeries.id
LEFT OUTER JOIN Distribution ON (
    BugTask.distribution = Distribution.id
    OR DistroSeries.distribution = Distribution.id)
WHERE
    BugTask.bug = 1

> + # It would be nice to be able to just SELECT * here, but we need to
> + # format the timestamps so that datetime.strptime() won't choke on
> + # them.

Using epoch time would avoid the need for parsing strings (use
'EXTRACT(epoch FROM datecreated)' instead of 'TO_CHAR(...)').
It probably doesn't save any significant time though.

> + if bug_data.nrows() == 0:
> + return 0

I'd return None or raise an exception here - the bug doesn't exist so
the heat is undefined.

> + bug = bug_data[0]
> + if bug['duplicateof'] is not None:
> + return 0

Again, None or the heat of the master bug. We should catch cases where
we are attempting to calculate the bug heat of non-existent or
duplicate bugs.

> + if bug['private']:
> + heat['privacy'] = BugHeatConstants.PRIVACY
> + if bug['security_related']:
> + heat['security'] = BugHeatConstants.SECURITY
> +
> + # Get the heat from subscribers.
> + sub_count = plpy.execute("""
> + SELECT bug, COUNT(id) AS sub_count
> + FROM BugSubscription
> + WHERE bug = %s
> + GROUP BY bug
> + """ % bug_id)

This is unnecessarily complex. The simpler query is just "SELECT
COUNT(*) FROM BugSubscription WHERE bug=%s".

> + if sub_count.nrows() > 0:
> + heat['subscribers'] = (
> + BugHeatConstants.SUBSCRIBER * sub_count[0]['sub_count'])

You don't need the nrows() guard, as the query will always return one row.

> + # Get the heat from subscribers via duplicates.
> + subs_from_dupes = plpy.execute("""
> + SELECT bug.duplicateof AS dupe_of,
> + COUNT(bugsubscription.id) AS dupe_sub_count
> + FROM bugsubscription, bug
> + WHERE bug.duplicateof = %s
> + AND Bugsubscription.bug = bug.id
> + GROUP BY Bug.duplicateof""" % bug_id)
> + if subs_from_dupes.nrows() > 0:
> + heat['subcribers_from_dupes'] = (
> + BugHeatConstants.SUBSCRIBER * sub_count[0]['dupe_sub_count'])

Similar here. nrows() guard is unnecessary. The query is also overly complex:

SELECT COUNT(*) FROM BugSubscription, Bug WHERE Bug.duplicateof = %s
AND BugSubscription.bug = Bug.id

I notice that if a person is subscribed to both a duplicate and the
master job, they are counted twice. If this is not correct, we can
collapse the above two sections into one using the query:

SELECT COUNT(DISTINCT BugSubscription.person)
FROM BugSubscription, Bug
WHERE Bug.id = BugSubscription.bug
AND (Bug.id = %s OR Bug.duplicateof = %s)

If you want to expand teams to get the genuine number of people subscribed:

SELECT COUNT(DISTINCT TeamParticipation.person)
FROM TeamParticipation, BugSubscription, Bug
WHERE
    TeamParticipation.team = BugSubscription.person
    AND Bug.id = BugSubscription.bug
    AND (Bug.id = %s OR Bug.duplicateof = %s)

Bug #1 for instance has 313 subscribers without expanding the teams,
or 678 if you do expand the teams.

> + # Bugs decay over time. Every day the bug isn't touched its heat
> + # decreases by 1%.
> + pg_datetime_fmt = "%Y-%m-%d %H:%M:%S"
> + date_last_updated = datetime.strptime(
> + bug['formatted_date_last_updated'], pg_datetime_fmt)
> + days_since_last_update = (datetime.utcnow() - date_last_updated).days
> + total_heat = int(total_heat * (0.99 ** days_since_last_update))

Yer... this would be simpler and clearer using epoch times.

> +INSERT INTO LaunchpadDatabaseRevision VALUES (2207, 99, 0);

As before, this should go into trusted.sql rather than a DB patch. We
get to treat stored procedures as any other code, unlike database
patches which cannot be reverted once applied.

> -Sometimes, when a bug changes, we want to see the changes reflected in the bug's
> -heat value immidiately, without waiting for heat to be recalculated. Currently
> -we adjust heat immidiately for bug privacy and security.

immediately, not immidiately.

> + def updateHeat(self):
> + """See `IBug`."""
> + # We need to flush the store first to ensure that changes are
> + # reflected in the new bug heat total.
> + store = Store.of(self)
> + store.flush()

You sure about that? Storm should be applying updates in order, so
when calculate_bug_heat is actually executed (at commit time or when
heat is next accessed) changes will be implicitly flushed.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

On Wed, May 26, 2010 at 10:26 PM, Graham Binns <graham@canonical.com> wrote:

> === added file 'database/schema/patch-2207-99-0.sql'

There is no need for a database patch - the stored procedure should be
added to trusted.sql (which is becoming unwieldy - I think I need to
split it up).

> +CREATE OR REPLACE FUNCTION calculate_bug_heat(bug_id integer) RETURNS integer
> +LANGUAGE plpythonu AS $$
> +    from datetime import datetime
> +
> +    class BugHeatConstants:
> +        PRIVACY = 150
> +        SECURITY = 250
> +        DUPLICATE = 6
> +        AFFECTED_USER = 4
> +        SUBSCRIBER = 2
> +
> +
> +    def get_max_heat_for_bug(bug_id):
> +        bug_tasks = plpy.execute(
> +            "SELECT * FROM BugTask WHERE bug = %s" % bug_id)
> +
> +        max_heats = []
> +        for bug_task in bug_tasks:
> +            if bug_task['product'] is not None:
> +                product = plpy.execute(
> +                    "SELECT max_bug_heat FROM Product WHERE id = %s" %
> +                    bug_task['product'])[0]
> +                max_heats.append(product['max_bug_heat'])
> +            elif bug_task['distribution']:
> +                distribution = plpy.execute(
> +                    "SELECT max_bug_heat FROM Distribution WHERE id = %s" %
> +                    bug_task['distribution'])[0]
> +                max_heats.append(distribution['max_bug_heat'])
> +            elif bug_task['productseries'] is not None:
> +                product_series = plpy.execute("""
> +                    SELECT Product.max_bug_heat
> +                      FROM ProductSeries, Product
> +                     WHERE ProductSeries.Product = Product.id
> +                       AND ProductSeries.id = %s"""%
> +                    bug_task['productseries'])[0]
> +                max_heats.append(product_series['max_bug_heat'])
> +            elif bug_task['distroseries']:
> +                distro_series = plpy.execute("""
> +                    SELECT Distribution.max_bug_heat
> +                      FROM DistroSeries, Distribution
> +                     WHERE DistroSeries.Distribution = Distribution.id
> +                       AND DistroSeries.id = %s"""%
> +                    bug_task['distroseries'])[0]
> +                max_heats.append(distro_series['max_bug_heat'])
> +            else:
> +                pass
> +
> +        return max(max_heats)

You can do this in a single query rather than one per task:

> +    # It would be nice to be able to just SELECT * here, but we need to
> +    # format the timestamps so that datetime.strptime() won't choke on
> +    # them.

Using epoch time would avoid the need for parsing strings (use
'EXTRACT(epoch FROM datecreated)' instead of 'TO_CHAR(...)').
It probably doesn't save any significant time though.

> +    if bug_data.nrows() == 0:
> +        return 0

I'd return None or raise an exception here - the bug doesn't exist so
the heat is undefined.

> +    bug = bug_data[0]
> +    if bug['duplicateof'] is not None:
> +        return 0

Again, None or the heat of the master bug. We should catch cases where
we are attempting to calculate the bug heat of non-existent or
duplicate bugs.

> +    if bug['private']:
> +        heat['privacy'] = BugHeatConstants.PRIVACY
> +    if bug['security_related']:
> +        heat['security'] = BugHeatConstants.SECURITY
> +
> +    # Get the heat from subscribers.
> +    sub_count = plpy.execute("""
> +        SELECT bug, COUNT(id) AS sub_count
> +            FROM BugSubscription
> +            WHERE bug = %s
> +            GROUP BY bug
> +            """ % bug_id)

This is unnecessarily complex. The simpler query is just "SELECT
COUNT(*) FROM BugSubscription WHERE bug=%s".

> +    if sub_count.nrows() > 0:
> +        heat['subscribers'] = (
> +            BugHeatConstants.SUBSCRIBER * sub_count[0]['sub_count'])

You don't need the nrows() guard, as the query will always return one row.

> +    # Get the heat from subscribers via duplicates.
> +    subs_from_dupes = plpy.execute("""
> +        SELECT bug.duplicateof AS dupe_of,
> +                COUNT(bugsubscription.id) AS dupe_sub_count
> +        FROM bugsubscription, bug
> +        WHERE   bug.duplicateof = %s
> +            AND Bugsubscription.bug = bug.id
> +        GROUP BY Bug.duplicateof""" % bug_id)
> +    if subs_from_dupes.nrows() > 0:
> +        heat['subcribers_from_dupes'] = (
> +            BugHeatConstants.SUBSCRIBER * sub_count[0]['dupe_sub_count'])

Similar here. nrows() guard is unnecessary. The query is also overly complex:

SELECT COUNT(*) FROM BugSubscription, Bug WHERE Bug.duplicateof = %s
AND BugSubscription.bug = Bug.id

I notice that if a person is subscribed to both a duplicate and the
master job, they are counted twice. If this is not correct, we can
collapse the above two sections into one using the query:

SELECT COUNT(DISTINCT BugSubscription.person)
FROM BugSubscription, Bug
WHERE Bug.id = BugSubscription.bug
    AND (Bug.id = %s OR Bug.duplicateof = %s)

If you want to expand teams to get the genuine number of people subscribed:

Bug #1 for instance has 313 subscribers without expanding the teams,
or 678 if you do expand the teams.

> +    # Bugs decay over time. Every day the bug isn't touched its heat
> +    # decreases by 1%.
> +    pg_datetime_fmt = "%Y-%m-%d %H:%M:%S"
> +    date_last_updated = datetime.strptime(
> +        bug['formatted_date_last_updated'], pg_datetime_fmt)
> +    days_since_last_update = (datetime.utcnow() - date_last_updated).days
> +    total_heat = int(total_heat * (0.99 ** days_since_last_update))

Yer... this would be simpler and clearer using epoch times.

> +INSERT INTO LaunchpadDatabaseRevision VALUES (2207, 99, 0);

As before, this should go into trusted.sql rather than a DB patch. We
get to treat stored procedures as any other code, unlike database
patches which cannot be reverted once applied.

immediately, not immidiately.

> +    def updateHeat(self):
> +        """See `IBug`."""
> +        # We need to flush the store first to ensure that changes are
> +        # reflected in the new bug heat total.
> +        store = Store.of(self)
> +        store.flush()

You sure about that? Storm should be applying updates in order, so
when calculate_bug_heat is actually executed (at commit time or when
heat is next accessed) changes will be implicitly flushed.

-- 
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

« Back to merge proposal